Outcome-as-a-Service · DEMONSTRATED

Forge-CRS — Autonomous Cyber Reasoning System

A working AIxCC-style cyber reasoning system that autonomously discovers, exploits, patches, and re-verifies real-world software vulnerability classes — coverage-guided fuzzing, crash triage, semantic patching, and patch validation in one unattended loop.

DELIVERED Verification 100% (37 / 37) DEMONSTRATED v1.0.0
Forge-CRS autonomous campaign results

At a glance

37 / 37checks passing
5/5remediated
5CWE classes
0external deps

The autonomous loop

1 · Discover

Coverage-guided fuzzing on real V8 block coverage, with dictionary + structure-aware mutation. New-coverage inputs are kept and evolved.

2 · Exploit

A multi-signal crash oracle confirms a real bug; the crash is minimized to a tight PoV and classified to a CWE independently of ground truth.

3 · Patch

A semantic source rewrite for the bug class is synthesized and applied to a throwaway copy — never trusted until proven.

4 · Verify

The patch is accepted only if the PoV is neutralized AND every functional regression case still passes.

Benchmark targets

config-merge CWE-1321

Prototype Pollution — discovered in 3 execs, PoV minimized to 31B, patch verified (2/2 regression).

path-store CWE-22

Path Traversal — discovered in 189 execs, PoV minimized to 2B, patch verified (2/2 regression).

task-runner CWE-78

OS Command Injection — discovered in 14 execs, PoV minimized to 1B, patch verified (1/1 regression).

regex-validate CWE-1333

ReDoS (catastrophic backtracking) — discovered in 40 execs, PoV minimized to 23B, patch verified (4/4 regression).

binary-reader CWE-125

Out-of-bounds Read — discovered in 5 execs, PoV minimized to 2B, patch verified (2/2 regression).

Deliverables

Honest scope

DARPA's AIxCC exists because autonomous find-and-fix on real-world OSS at scale is unsolved. Forge-CRS is a faithful, fully-working microcosm of that loop — not a finalist-grade system. The executed pipeline runs on the JavaScript/Node language adapter; the C/C++/Java adapters (native sanitizers, libFuzzer/AFL++, Jazzer), whole-repo scale, and free-form program repair are disclosed as architectural seams, not live. Full disclosure in the certification report.

Delivery metrics

Tokens, elapsed time, and cost for producing this outcome. Basis: Estimated — reproducible model over this outcome’s published artifacts. metrics.json

~130.9k
Tokens used
~43m 38s
Elapsed time
~$0.589
Cost (USD)

Cost and tokens are estimates derived deterministically from published artifacts and representative list pricing; actual billing may differ. Model basis: claude-sonnet-class (representative).