VERDICT Accuracy Report¶
VERDICT is evaluated on two axes:
- Whether it surfaces known reportable activity when supported artifacts are parsed.
- Whether it refuses to overclaim when coverage is partial, single-source, or unsupported.
The second axis is as important as recall. A scoped INDETERMINATE is the correct
answer when evidence coverage is too thin to corroborate a stronger claim.
Scoring Harness¶
scripts/score-recall.py compares a completed run's verdict.json against an
answer key under goldens/<case-id>/expected-findings.json.
The scorer reports:
| Metric | Meaning |
|---|---|
expected_n |
Number of expected claims in the answer key. |
recalled_n |
Expected claims matched by run findings. |
recall_percent |
recalled_n / expected_n, rounded. |
verdict_match |
Whether the run Verdict is polarity-consistent with the answer key. |
pass |
recall_percent meets the case bar and the Verdict is consistent. |
Matching is intentionally conservative: the scorer uses distinctive token overlap and maximum bipartite matching so one verbose run Finding cannot satisfy several expected claims.
Current Public Corpus¶
The repository ships small answer-key JSON files in goldens/. Large fixtures are
not committed; scripts/fetch-fixtures.sh stages public datasets into fixtures/
when the operator wants to run benchmark cases.
| Case | Artifact class | Purpose |
|---|---|---|
nitroba |
PCAP | Network-evidence recall without over-attribution. |
nist-hacking-case |
Disk | Hacking-tool execution and artifact-corroboration coverage. |
otrf-apt3-mordor |
Windows logs | EVTX/Sysmon/JSON correlation against OTRF's APT3 emulation telemetry. |
memlabs-lab1-memlabs-lab3 |
Windows memory | Volatility-oriented memory extraction coverage using CTF-style objectives without committed flag values. |
digitalcorpora-lonewolf |
Windows disk + memory | Large Digital Corpora laptop scenario; records required artifacts and non-scored leads until an authorized teacher guide is available. |
synthetic-benign |
Synthetic control | False-positive floor: zero findings should remain NO_EVIL. |
sans-starter |
Mixed | SANS starter-case answer-key placeholder for local/eventual scoring. |
| Additional public cases | Disk, memory, Android, Linux | Regression corpus for parser expansion and confidence calibration. |
False Positives¶
False-positive handling is measured as a first-class outcome, not treated as an afterthought. The current controls are:
synthetic-benignexpects zero Findings and a scopedNO_EVILverdict.alihadi-09-encryptis the explicit false-positive control: encryption tools can be present without proving malicious activity, so the expected verdict isINDETERMINATE; an overconfidentSUSPICIOUS/ legacyCONFIRMED_EVILresult fails the scorer.- Report QA blocks unsupported execution and exfiltration wording. Network-only activity, Amcache-only evidence, ShimCache-only evidence, memory-only process evidence, YARA-only hits, Hayabusa-only hits, and malfind-only hits remain leads unless corroborated by the required artifact classes.
- The committed EVTX execution trace in
release-evidence/evtx-security-log-clear-trace-summary.jsonrecords one confirmed Security EID 1102 log-clear Finding and keeps unrelated ATT&CK blind spots as warnings rather than negative claims.
The release packet does not claim a global precision score. Precision is reported per scored case when a trustworthy answer key and completed run output exist.
Missed Artifacts¶
Misses are documented explicitly so partial coverage cannot be mistaken for clearance:
nist-hacking-casecurrently recalls 7/14 expected claims (50%), below the 71% bar. The unmatched IDs arenhc-001,nhc-002,nhc-003,nhc-006,nhc-012,nhc-013, andnhc-014, covering artifacts not yet parsed in the committed run: ACMru/search history, USB history, email carving,index.dat/browser history, XP.evt, thumbcache, and named-pipe enum.- Large or gated datasets remain marked
staged, run pending evidenceuntil the exact fixture is available and scored. No recall number is fabricated for those rows. - Every live run writes
coverage_manifest.json, which records each artifact class as parsed, failed, unsupported, or not supplied. The EVTX trace summary records four not-supplied classes (disk/filesystem,memory,network, andvelociraptor) so reviewers can see what was outside the run scope.
Hallucinated Claims Found During Testing¶
The main hallucination class found during testing was not invented IOC text; it was overclaiming from thin evidence. The controls and observed fixes are:
- The first Nitroba network run returned
NO_EVILwith 0 Findings because the packet cap hid late-case traffic and a truncated final packet caused useful stdout to be discarded. The fix raised the packet cap, tolerated partial tshark output when stdout is usable, added anonymous-email/cookie timeline extraction, and changed the judge grouping so onepcap_triagecall can produce multiple distinct claims. - The scorer was hardened from symmetric Jaccard matching to expected-coverage plus maximum bipartite matching, so a verbose broad Finding cannot satisfy multiple expected claims and match order cannot inflate recall.
- Findings without a current-case
tool_call_idare vetoed. A claim whose cited tool output cannot be replayed or whose hash drifts is rejected or downgraded before it reaches the final report. - Prompt-based guidance is not trusted as the final defense. If a model or
operator wording tries to claim execution from a single weak artifact, report
QA and the correlator keep that claim at
HYPOTHESIS, downgrade it, or block customer-ready output.
No current release packet includes a hallucinated, uncited Finding as a valid
Finding. When uncertain coverage remains, it is represented as a warning,
limitation, contradiction, or HYPOTHESIS instead of a confirmed claim.
Stage Two Adversarial Checks¶
Stage Two review is treated as hostile trace review, not as a demo-narrative exercise. The checks we expect judges to run are:
- False positives found:
alihadi-09-encryptremains an explicit control for benign or dual-use encryption-tool presence. The correct answer is scopedINDETERMINATE, not a confident suspicious verdict from tool presence alone. - Missed artifacts: the public NIST Hacking Case score is still 7/14 recall.
The missing ACMru/search history, USB history, email carving, browser history,
XP
.evt, thumbcache, and named-pipe artifacts are published as misses rather than hidden behind a broad accuracy claim. - Hallucination and overclaim classes caught: uncited Findings, replay hash drift, unsupported execution wording, single-source execution claims, and unsupported exfiltration claims are vetoed, downgraded, or held as warnings by verifier/report-QA/correlator controls before release material is considered.
- Three-claim trace methodology: pick any three Findings from a report and
trace each one to
finding_approved.tool_call_id, the matchingtool_call_start, itstool_call_output.output_hash, verifier replay records, and the manifest verification result. The committed EVTX packet is a compact public example of this method. - Self-correction limitation: the clean Stage Two packet is traceability
evidence with
fault_injection=0. If a clean run has no organic runtime failure, it must not be described as organic self-correction. The injected verifier re-dispatch run is optional harness/demo evidence only.
Evidence Integrity¶
Evidence integrity is enforced architecturally rather than only by prompt text:
case_openSHA-256s the evidence at the start of the Case.- The product MCP surface has no
execute_shelland no write verb for evidence. Evidence tools open source artifacts read-only; hardened deployments should also use a read-only mount / filesystem permissions. - Each tool output is hashed into
audit.jsonl, and each audit record links to the previous record throughprev_hash. manifest_finalizeseals the run with a Merkle root over canonical tool outputs plus a signature;manifest_verifyreplays the audit chain, leaf count, Merkle root, and signature offline.- Every reportable Finding cites a current-case
tool_call_id. The verifier re-runs the cited tool call and compares the replay output SHA-256 before the judge consumes the Finding.
If a prompt-based restriction is ignored, the architectural controls still limit
the damage: there is no raw shell tool to mutate evidence, no uncited Finding can
pass schema/verifier checks, single-source execution wording is blocked by
policy, and coverage_manifest.json prevents unsupported areas from being
reported as clean.
Reproduce A Score¶
bash scripts/fetch-fixtures.sh
scripts/verdict fixtures/<case-path> --no-dashboard
python scripts/score-recall.py tmp/auto-runs/<case-id> --golden goldens/<case-id>
For day-to-day development, run focused smokes first:
python scripts/verdict-policy-smoke.py
python scripts/report-policy-smoke.py
python scripts/path-existence-smoke.py
bash scripts/run-all-smokes.sh
Calibration Rules¶
- Execution claims require at least two current-case artifact classes.
- Amcache, ShimCache, memory-only process evidence, YARA, Hayabusa, or malfind alone is not enough for a confirmed execution claim.
- Network-only activity can surface leads, but it does not identify a human actor.
- Parser failure is a coverage limitation, not evidence of absence.
- Unsupported raw disk coverage must remain custody-only until supported artifacts are mounted or extracted.
Known Limits¶
- The public source tree does not ship bulky completed case directories or raw
evidence. Operators produce fresh
tmp/auto-runs/<case-id>/artifacts locally. - Some benchmark fixtures require gated or large downloads and may need manual staging before scoring.
- Accuracy should be reported per case and per artifact class, not as a broad product-wide clean-bill statement.
Related docs: DATASET.md, false-positives.md,
cryptographic-attestation.md, and
live-test-matrix.md.