Forge-CRS — Autonomous Cyber Reasoning System

Certification

← Back to outcome

Certification Report — Forge-CRS

Certification level: DEMONSTRATED (research-grade, honestly scoped). This is not certified as an AIxCC-finalist-grade Cyber Reasoning System, and it is not certified for production security use against arbitrary software.

The certification level is set from real evidence (the verification run), not aspiration. See verification-report.md for the machine-checked results.

The original problem, stated plainly

DARPA's AI Cyber Challenge exists because *autonomously finding and fixing vulnerabilities in real-world open-source software, at scale, with minimal human intervention* is an unsolved problem. A single session cannot deliver a system that resolves it; any claim otherwise would be dishonest. Beating that bar requires whole-repository analysis of C/C++/Java, native sanitizer + fuzzer harnessing, semantic program repair of novel bugs, and a scoring/SARIF harness — a multi-team, multi-year effort.

What IS certified (live, executed, re-verifiable)

Forge-CRS is a complete, working implementation of the AIxCC loop — discover → exploit → patch → verify — running unattended and deterministically. The verification run (node verify.mjs) proves, with 37/37 checks passing:

ClaimEvidence
Autonomous discovery via coverage-guided fuzzing5/5 targets crashed by fuzzing the *unpatched* code; V8 block coverage confirmed live on every in-process target.
Correct vulnerability classification5/5 crash signals classified to the correct CWE, checked against ground truth.
PoV minimizationRandom crashing inputs reduced to tight reproducers (e.g. .., 0xff7f, {"__proto__":{"polluted":true}}).
Automated patching5/5 semantic source rewrites synthesized and applied.
Patch correctness5/5 PoVs neutralized by the patch and 11/11 functional regression cases preserved.
DeterminismTwo identically-seeded campaigns are bit-for-bit identical.
SafetyNo real exploit executed; FS/process sinks are injected recorders; hangs are sandboxed in killable workers.

Vulnerability classes covered are real and common in OSS: CWE-1321 prototype pollution, CWE-22 path traversal, CWE-78 command injection, CWE-1333 ReDoS, CWE-125 out-of-bounds read.

What ships as a SEAM (explicitly NOT live)

These are designed-for but not executed in this package. They are disclosed here rather than implied to work:

  1. C/C++/Java adapters — the languages AIxCC actually scores. The

registry/adapter interface is language-agnostic, but native sanitizer (ASan/UBSan), libFuzzer/AFL++, and Jazzer integrations are not wired up or run here. The live, executed adapter is JavaScript/Node only.

  1. Whole-repository / OSS scale — the benchmark is single-file targets, not

million-line repos with build systems and persistent corpora.

  1. Free-form program repair — patch synthesis uses bug-class strategies

(key-guard, containment check, argv-not-shell, linear regex, bounds clamp), not learned/LLM repair of previously-unseen bug shapes.

  1. Real PoV detonation & exploit chaining — exploitation is observed via

instrumented sinks; the CRS does not execute real shellcode or chain primitives.

  1. Scoring/SARIF harness & CWE coverage breadth — only the five classes

above are implemented.

Honest bottom line

Forge-CRS is a faithful, fully-working microcosm of an AIxCC-style Cyber Reasoning System. It genuinely closes the discover→exploit→patch→verify loop without human intervention and proves each step — but on a controlled benchmark in one language adapter, not on arbitrary real-world OSS at competition scale. The gap between this and a finalist CRS is exactly the set of seams listed above, and that gap is the unsolved part of the problem.