Dataset Documentation¶
What the agent was tested against, the source of each dataset, and what it found.
This document covers every fixture VERDICT was tested against. All fixtures are either public domain, permissively licensed, or pulled from SANS's own starter case data. None are bundled in the git tree; scripts/fetch-fixtures.sh pulls them at CI time.
Primary golden: SANS starter case data¶
| Attribute | Value |
|---|---|
| Source | SANS official starter case data |
| URL | https://sansorg.egnyte.com/fl/HhH7crTYT4JK |
| License | Distributed as starter case data by SANS Institute |
| Content | Sample disk images + memory captures |
| Purpose | Intended primary L3 golden-run fixture — the primary reference golden fixture. Still a pending stub: goldens/sans-starter/expected-findings.json has no findings enumerated and has never been run or scored. |
| SHA-256 | Required as SANS_STARTER_SHA256 when SANS_STARTER_URL is set; recorded in fixtures/sha256sums.txt |
| Expected findings | (enumerated in goldens/sans-starter/expected-findings.json after first manual walk-through) |
Rationale for primary status: This dataset is published by SANS and is widely familiar to working DFIR practitioners. Optimizing for it aligns our accuracy metrics with a recognized reference baseline.
Secondary: NIST CFReDS Hacking Case¶
| Attribute | Value |
|---|---|
| Source | NIST Computer Forensics Reference Data Sets |
| URL | https://cfreds.nist.gov/all/NIST/HackingCase |
| License | Public domain (17 USC 105 — U.S. government works are not copyrightable) |
| Content | EnCase E01 (~4.5 GB compressed / ~4.8 GB raw NTFS); Windows host evidence |
| Purpose | Canonical DFIR benchmark case; industry-standard ground truth |
| SHA-256 | (recorded on first pull) |
| Expected findings | 14 canonical findings — enumerated in goldens/nist-hacking-case/expected-findings.json |
| Expected VERDICT top-line | SUSPICIOUS when corroborated; older goldens may still use CONFIRMED_EVIL as a scoring label |
Rationale: NIST's authority makes this a standard reference. Multiple DFIR tools publish accuracy against it, so our DFIR-Metric score is directly comparable to any competitor.
Lightweight extract — single Security.evtx for fast smoke¶
For developer-laptop iteration we don't always want the 4.5 GB E01. scripts/fetch-nist-fixture.sh pulls one small Security.evtx at fixtures/single-evtx/Security.evtx, used by python scripts/rust-mcp-smoke.py --real-evidence. Source URL is intentionally NOT hardcoded — set via env vars so operators can point at a vetted mirror without an upstream URL change breaking CI:
NIST_FIXTURE_URL=https://example.org/path/to/Security.evtx \
NIST_FIXTURE_SHA256=<64-hex-digits> \
bash scripts/fetch-nist-fixture.sh
Vetted candidate sources (any one is sufficient):
- An OTRF Security-Datasets sample with a single standalone .evtx payload (the datasets/atomic/windows/credential_access and datasets/atomic/windows/defense_evasion subtrees ship sub-MB EVTX files).
- An internal team mirror of CFReDS Hacking Case Security.evtx extracted via The Sleuth Kit's fls+icat from SCHARDT.001.
- A small synthetic EVTX produced by wevtutil epl on a clean Win10 host.
The fetch script is deliberately strict: SHA pin enforced when supplied, magic-byte sanity check (ElfFile\0) on every download, atomic rename, provenance recorded at fixtures/single-evtx/PROVENANCE.txt. The smoke harness skips silently when the fixture is absent so offline runs still pass.
Secondary: OTRF Security-Datasets (formerly Mordor)¶
| Attribute | Value |
|---|---|
| Source | Open Threat Research Forge |
| URL | https://github.com/OTRF/Security-Datasets |
| License | MIT License |
| Content | EVTX / JSON / Zeek replay datasets for specific attack scenarios (APT3, APT29, Empire, Covenant, Cobalt Strike) |
| Purpose | Behavior-specific validation; exercises Hayabusa Sigma rules and event-correlation paths |
| SHA-256 | (per-dataset, recorded on pull) |
| Expected findings | Per-dataset; scenario-specific (e.g., APT3-Mordor expects lateral movement T1021.006) |
| Verdict | Varies per dataset |
Rationale: Each dataset isolates a named attack pattern, so Hayabusa rule coverage can be validated precisely. Used in L2 smoke tests (non-blocking advisory) and L3 matrix runs.
Secondary: Volatility Foundation Memory Samples¶
| Attribute | Value |
|---|---|
| Source | Volatility Foundation |
| URL | https://github.com/volatilityfoundation/volatility/wiki/Memory-Samples |
| License | Creative Commons Attribution (CC-BY) — redistribute with attribution |
| Content | Known-good + known-malicious memory dumps (Cridex, Stuxnet, SpyEye samples, etc.) |
| Purpose | Volatility3 plugin validation; exercises vol_pslist, vol_malfind, cross-artifact memory→disk correlation |
| SHA-256 | (per-sample) |
| Expected findings | Per-sample (e.g., Cridex: injected PID list, malfind RWX regions) |
| Verdict | Varies per sample |
Rationale: Memory-specific ground truth. Windows profile auto-detection tested here before L3 runs.
Synthetic benign baseline¶
| Attribute | Value |
|---|---|
| Source | Internal, synthetic (generated by the build process) |
| URL | (not applicable — produced at CI time) |
| License | MIT (our own generation script) |
| Content | Clean Windows 10 install, patched, no tradecraft, representative baseline activity only |
| Purpose | Negative control — the agent must NOT produce false-positive findings |
| SHA-256 | (per-generation) |
| Expected findings | 0 (verdict: NO_EVIL) |
| Verdict | NO_EVIL |
Rationale: A tool that only finds evil on evil data is useless. This fixture verifies that the agent distinguishes benign systems from compromised ones — addresses the "hallucination" criticism that Valhuntir explicitly warns about but does not measure.
Tiny regression fixture matrix¶
These are source-controlled smoke fixtures, not bundled evidence images. They keep
the final automation gates runnable on a laptop while larger public datasets stay
ignored under fixtures/ and goldens/.
Run the matrix with:
| Scenario | Tiny fixture or smoke input | Locked behavior |
|---|---|---|
| Benign | Synthetic benign EVTX rows in scripts/verdict-policy-smoke.py |
Parsed benign rows produce zero Findings and scoped NO_EVIL. |
| EVTX-only | Synthetic Security EID 4698 scheduled-task row | Suspicious task creation emits one cited HYPOTHESIS Finding and remains INDETERMINATE. |
| Memory DKOM | Synthetic pslist / psscan divergence |
Process-view divergence requires psxview follow-up and remains evidence-scoped. |
| Memory injection | Synthetic malfind RWX/MZ observable | Injection triage stays HYPOTHESIS and cites the malfind tool call. |
| Custody-only disk | Synthetic E01 case_open-only observable |
Disk custody registration alone stays INDETERMINATE and does not mark disk contents touched. |
| Extracted-disk persistence | Synthetic extracted Prefetch plus Registry artifacts | Extracted disk artifacts dispatch to prefetch_parse and registry_query. |
| Network-only | Synthetic PCAP-only execution-overclaim QA packet | Report QA blocks network-only execution wording. |
| Velociraptor zip | Synthetic Velociraptor zip member inventory with contained Prefetch | Safe contained artifacts extract and dispatch to typed parsers. |
| Mixed full-case | Synthetic directory containing memory, EVTX, raw disk, and extracted disk artifacts | Mixed inventories mark supplied classes touched and can produce scoped NO_EVIL only after substantive parsers run. |
The matrix deliberately avoids fake production evidence and fake malicious demo findings. It verifies policy behavior, dispatch coverage, and overclaim blockers using tiny synthetic inputs; real-evidence accuracy still belongs to the public goldens above and larger ignored fixtures.
DFIR-Metric benchmark suite¶
| Attribute | Value |
|---|---|
| Source | DFIR-Metric research project |
| URL | https://github.com/DFIR-Metric |
| Paper | https://arxiv.org/abs/2505.19973 |
| License | (per repo — permissive, verified at Week 6) |
| Content | 700 MCQs + 150 CTF tasks + 500 NIST cases, designed to evaluate LLMs on DFIR |
| Purpose | Standardized accuracy metric; external validation of agent quality |
| SHA-256 | (per benchmark release) |
| Expected findings | (per case in the benchmark — documented by DFIR-Metric, not by us) |
| Verdict | Scored per DFIR-Metric rubric |
Rationale: The only public DFIR-specific benchmark. Publishing our score here (via the M1 leaderboard) differentiates us from Valhuntir, which explicitly declines to publish any accuracy metric.
DFRWS Rodeo and USB challenges¶
| Attribute | Value |
|---|---|
| Source | Digital Forensic Research Workshop, hosted on NIST CFReDS |
| URL | https://cfreds.nist.gov/ |
| License | Public domain |
| Content | Small USB DD images (~500 MB each) with deliberate artifacts |
| Purpose | Fast smoke tests in L1/L2 (images small enough to cache in CI) |
| SHA-256 | (per-image) |
| Expected findings | Per-challenge (documented per-case) |
| Verdict | Varies |
Rationale: Small size + public domain = ideal for L1/L2 rapid iteration where full SIFT VM isn't needed.
Fixture caching and integrity¶
All fixtures are fetched by scripts/fetch-fixtures.sh (Spec #3 Task 10). On first pull, each file's SHA-256 is computed and recorded in fixtures/sha256sums.txt. Subsequent runs verify the checksum; mismatches abort with clear error. This prevents a fixture swap from silently altering benchmark scores.
Storage policy:
- Never committed to git. .gitignore excludes *.E01, *.ova, *.raw, *.mem, *.dd, *.aff, *.aff4.
- Not bundled in the release archive. Fixture URLs documented here; operators fetch via scripts/fetch-fixtures.sh.
- Cached in GHA via actions/cache keyed on fixtures/sha256sums.txt hash.
Public DFIR benchmark suite (one scenario per artifact class)¶
Public and candidate benchmark datasets are onboarded so we can do live
runs against every DFIR artifact class. Each has an answer-key file at
goldens/<case-id>/expected-findings.json, is fetched by scripts/fetch-fixtures.sh
(§6), and is scored offline by scripts/score-recall.py (recall vs min_recall_percent
plus honest verdict consistency). Run + score loop:
bash scripts/fetch-fixtures.sh # stage evidence (env vars below)
bash scripts/verdict fixtures/<case-id>/<evidence> # or --sift for disk classes
python scripts/score-recall.py tmp/auto-runs/<case-id> # recall vs golden
Tier is the thread's data-quality ranking, recorded as a caveat only — per project decision it is NOT a scoring gate (training-data contamination is not modeled). Tiers: 🟢 score against (trustworthy) · 🟡 build/test, score with care (answers gated) · 🟠 practice only (solutions public — likely in model training data) · 🔴 not ready.
| # | Case id | Class | Tier | Fetch | Expected outcome | Recall target |
|---|---|---|---|---|---|---|
| 1 | nitroba |
network (pcap) | 🟢 | NITROBA_URL (default digitalcorpora) |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 80% |
| 2 | nist-data-leakage |
disk (insider exfil) | 🟢 | DATA_LEAKAGE_URL + DATA_LEAKAGE_SHA256 |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 60% |
| 3 | nist-hacking-case |
disk (XP) | 🟢 | default cfreds URL (already wired) | SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 71% |
| 4 | otrf-apt3-mordor |
Windows logs (EVTX/Sysmon/JSON) | 🟢 | sparse clone from OTRF Security-Datasets | SUSPICIOUS | 60% |
| 5 | memlabs-lab1 |
Windows memory | 🟡 | MEMLABS_LAB1_URL + MEMLABS_LAB1_SHA256 (extracted memory dump direct URL or file://) |
SUSPICIOUS | 67% |
| 6 | memlabs-lab2 |
Windows memory | 🟡 | MEMLABS_LAB2_URL + MEMLABS_LAB2_SHA256 (extracted memory dump direct URL or file://) |
SUSPICIOUS | 67% |
| 7 | memlabs-lab3 |
Windows memory | 🟡 | MEMLABS_LAB3_URL + MEMLABS_LAB3_SHA256 (extracted memory dump direct URL or file://) |
SUSPICIOUS | 67% |
| 8 | digitalcorpora-lonewolf |
Windows disk + memory | 🟡 | LONEWOLF_URL + LONEWOLF_SHA256 (large full Digital Corpora bundle) |
INDETERMINATE candidate | 0% |
| 9 | alihadi-09-encrypt |
disk (crypto) | 🟡 | ALIHADI09_URL |
INDETERMINATE (false-positive control) | 50% |
| 10 | alihadi-01-webserver |
disk + memory | 🟡 | ALIHADI01_URL |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 60% |
| 11 | dfrws-2008-linux |
memory+disk+network | 🟡 | pinned git clone (DFRWS2008_REF) |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 50% |
| 12 | m57-jean |
disk/email | 🟠 | M57_JEAN_URL (default digitalcorpora) |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 60% |
| 13 | alihadi-07-sysinternals |
disk (E01) | 🟠 | ALIHADI07_URL |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 50% |
| 14 | dfrws-2011-android |
mobile/disk | 🔴 | DFRWS2011_URL |
UNKNOWN (stub) | 40% |
| 15 | volatility-cridex |
memory | 🔴 (sourcing) | CRIDEX_URL (canonical link dead) |
SUSPICIOUS (legacy golden label: CONFIRMED_EVIL) | 50% |
Notable cases
- Windows-focused golden expansion. The Windows-heavy lane now covers logs
(otrf-apt3-mordor), memory (memlabs-lab1 through memlabs-lab3), and combined
disk+memory (digitalcorpora-lonewolf) in addition to the existing NIST and Ali Hadi
disk images. This is intentionally metadata/answer-key only; raw evidence remains under
fixtures/ when staged locally and is never committed.
- alihadi-09-encrypt is the false-positive control. Encryption tooling is present
but its presence is not proof of malice. The golden verdict is INDETERMINATE; a run
that escalates to SUSPICIOUS (or the legacy scoring label CONFIRMED_EVIL) FAILS the asymmetric verdict-match check
in score-recall.py. Findings are intentionally INFERRED/HYPOTHESIS.
- dfrws-2011-android TRAP: the upstream README hashes are labeled MD5 but are
actually SHA1 — do not chase a phantom mismatch. Evidence is on a personal Dropbox
that may vanish; mirror and recompute MD5+SHA256 before relying on it. The golden is a
stub (verdict UNKNOWN) pending a verified mirror + manual walkthrough.
- volatility-cridex sourcing is dead: the canonical
downloads.volatilityfoundation.org link no longer serves the image. Set CRIDEX_URL
to a verified mirror (a SANS-hosted copy with published hashes was requested in the
thread). The IOCs themselves are canonical (reader_sl.exe ← explorer.exe, malfind
injection, C2).
- otrf-apt3-mordor is the strongest Windows log expansion. It comes from OTRF
Security-Datasets' compound Windows APT3 telemetry and MITRE ATT&CK Evaluations Round 1
emulation material. scripts/fetch-fixtures.sh sparse-clones the compound APT3 tree plus
focused atomic Windows credential-access, defense-evasion, lateral-movement, and
persistence telemetry. This is a log-correlation golden, not a disk/memory image.
- memlabs-lab1 through memlabs-lab3 are Windows memory CTF labs. They are useful for
Volatility coverage and extraction behavior, but the committed goldens intentionally record
flag counts/objectives and source hashes, not the actual flag values. The upstream downloads
are Mega/browser-oriented, so fetch is env-gated via MEMLABS_LAB{1,2,3}_URL and requires
the matching MEMLABS_LAB{1,2,3}_SHA256 to point at a vetted direct mirror or file:// URL
for the extracted memory dump, not the compressed archive. The upstream archive MD5 remains
in each golden for provenance; the fetch helper verifies the staged memory dump MD5 before L3.
- digitalcorpora-lonewolf is a large Windows disk+memory candidate. Digital Corpora
publishes the E01 segments, memdump.mem, pagefile.sys, FTK log, and commercial forensic
outputs, but the teacher guide is password-protected/faculty-gated. Fetch is opt-in and requires
LONEWOLF_SHA256; until an authorized guide is available, the committed file records required
artifacts and non-scored lead hypotheses instead of reportable expected Findings.
- Disk classes need mount/extract prerequisites. Local raw .dd/.E01 runs can parse supported
artifacts when Sleuth Kit/libewf are present; otherwise they return INDETERMINATE
(custody-only). SIFT remains the recommended parity path. INDETERMINATE is an honest PASS of
the live-test gate when coverage is limited, but it will score below the recall target until
supported artifacts are parsed.
Run results (recall against golden)¶
(Populated as each obtainable dataset is run + scored. No fabricated numbers — gated/ unfetchable-on-host datasets are marked "staged, run pending evidence".)
| Case id | Run? | Verdict | Recall | Notes |
|---|---|---|---|---|
nitroba |
yes (local, tshark) | INDETERMINATE | 5/5 (100%) — PASS (bar=80%; local, not committed) | Network-playbook gaps fixed (see below). Surfaces all five: anonymous-email contact, source host (192.168.15.4), Gmail-cookie attribution, authenticated Facebook login, and the send-vs-browsing timeline correlation. |
otrf-apt3-mordor |
staged, run pending evidence | — | — | strongest Windows EVTX/Sysmon/JSON candidate; sparse clone only, no raw evidence committed |
memlabs-lab1 |
staged, run pending evidence | — | — | Windows memory CTF; requires extracted memory dump URL or local file URL |
memlabs-lab2 |
staged, run pending evidence | — | — | Windows memory CTF; requires extracted memory dump URL or local file URL |
memlabs-lab3 |
staged, run pending evidence | — | — | Windows memory CTF; requires extracted memory dump URL or local file URL |
digitalcorpora-lonewolf |
staged, run pending evidence | — | — | large Windows disk+memory scenario; teacher guide gated |
nist-data-leakage |
staged, run pending evidence | — | — | needs --sift (disk) |
nist-hacking-case |
yes (committed local summary) | SUSPICIOUS | 7/14 (50%) | coverage gap (not custody). Up from 1/14 after disk-artifact emitters and native fallback triage: now matches nhc-004 (hacking-tool files in Program Files/Desktop, from the MFT), nhc-005 (prefetch execution), nhc-007 (NTUSER shellbag navigation to a \\4.220.254\Temp staging share + tool folders), nhc-008 (LNK removable-media traces), nhc-009 (Recycle Bin staging artifacts), nhc-010 (suspiciously-named SAM account "Mr. Evil"), and nhc-011 (OpenSaveMRU recently-opened installers). Still below the 71% bar — the remaining seven: nhc-001 (ACMru/search history), nhc-002 (USB history), nhc-003 (email carving), nhc-006 (IE index.dat/browser history), nhc-012 (XP .evt, not EVTX), nhc-013 (thumbcache), and nhc-014 (named-pipe enum) are not yet parsed. |
alihadi-09-encrypt |
staged, run pending evidence | — | — | false-positive control; expect INDETERMINATE |
alihadi-01-webserver |
staged, run pending evidence | — | — | disk+memory correlation |
dfrws-2008-linux |
staged, run pending evidence | — | — | Linux memory+disk+network |
m57-jean |
staged, run pending evidence | — | — | practice only |
alihadi-07-sysinternals |
staged, run pending evidence | — | — | practice only |
dfrws-2011-android |
not ready | — | — | source unreliable; golden is a stub |
volatility-cridex |
staged, run pending evidence | — | — | source mirror needed |
Network-playbook fix (driven by the Nitroba false negative)¶
The first Nitroba run returned NO_EVIL with 0 findings. Root cause was three
compounding bugs, now fixed:
- Packet cap.
pcap_triageread only the first 10,000 of 83,153 packets; the harassment traffic sits at packets ~79,800–83,100. Raised the cap (services/mcp/src/tools/pcap_triage.rs) and the engine call (scripts/find_evil_auto.py) to read the whole capture. - Truncated-pcap intolerance. Reading to the end hit a truncated final packet; tshark exits non-zero on that but still emits every readable packet first. The tool now triages the packets it got instead of discarding all output when stdout is non-empty.
- No anonymous-email recognition + judge over-collapse. Added extraction of HTTP
requests (src→host, method, cookie) plus an anonymous/disposable/self-destruct
email-service host category, cookie-based attribution for both webmail and
social-media logins, and a send-vs-browsing timeline correlation (per-request
timestamps added to
pcap_triage) (scripts/find_evil_auto.py). The judge's_group_keycollapsed all findings from one tool call into one (it keyed on(tool_call_id, artifact_path)despite a docstring claiming otherwise); it now keys on the claim (services/agent/findevil_agent/judge.py), so a singlepcap_triagecall can yield multiple distinct findings.
The recall scorer's matcher was also hardened (scripts/score-recall.py): from
symmetric Jaccard to expected-coverage (recall asks whether the run surfaced each
ground-truth claim, so a verbose-but-correct finding should match a concise expected
one), then to maximum bipartite matching with a coverage floor of 0.5 and no
MITRE-technique shortcut. This enforces a 1:1 assignment (one run finding can't
satisfy two claims), requires the claim's distinctive tokens rather than generic
DFIR vocabulary, and finds the optimal assignment rather than a greedy one — so the
recall count can neither be inflated by a single broad finding nor under-counted by
match order. Controls were re-validated (synthetic-benign PASS, alihadi-09
over-confident FAIL, nist-hacking-case partial still below target).
Findings corpus (what the agent found)¶
(Populated incrementally as fixtures are run + scored. The public release keeps only small answer keys and evidence summaries; raw case outputs stay local.)
goldens/
├── sans-starter/
│ └── expected-findings.json (ground truth / answer key)
├── nist-hacking-case/
│ └── expected-findings.json
├── otrf-apt3-mordor/
│ └── expected-findings.json
├── memlabs-lab1/
│ └── expected-findings.json
├── memlabs-lab2/
│ └── expected-findings.json
├── memlabs-lab3/
│ └── expected-findings.json
├── digitalcorpora-lonewolf/
│ └── expected-findings.json
├── volatility-cridex/
│ └── expected-findings.json
└── synthetic-benign/
└── expected-findings.json (expected empty findings)
Completed case outputs are generated locally under tmp/auto-runs/<case-id>/ and are intentionally not shipped. Each generated run.manifest.json is verifiable offline by any third party — the entry points are the verify_manifest library function (from findevil_agent.crypto.manifest import verify_manifest) or the manifest_verify MCP tool. The pre-A2 find-evil verify <manifest> CLI was dropped along with findevil_agent/cli.py. See docs/cryptographic-attestation.md "How a third party verifies offline" for the working recipe.
Licensing summary¶
| Fixture | License | Redistribute? |
|---|---|---|
| SANS starter data | SANS starter case data | No (fetch from SANS) |
| NIST CFReDS | Public domain | Yes, by URL reference |
| OTRF Security-Datasets | MIT | Yes, by URL reference (attribution via fetch script) |
| Volatility samples | CC-BY | Yes, by URL reference (attribution via fetch script) |
| Synthetic benign | MIT (our script) | Yes |
| DFIR-Metric | Permissive (verified Week 6) | Yes, by URL reference |
| DFRWS Rodeo | Public domain | Yes |
| Digital Corpora (Nitroba, M57-Jean) | Freely redistributable (research/education) | Yes, by URL reference |
| NIST Data Leakage | Public domain (17 USC 105) | Yes, by URL reference |
| Ali Hadi challenges (#1/#7/#9) | Free for research/education (answers gated) | By URL reference |
| DFRWS 2008/2011 | Public for research/education | By URL reference |
None of these licenses contaminate our Apache-2.0 licensed release repo because we redistribute only URLs and SHA-256 hashes, not the fixtures themselves.