Methodology · Independent Audit Layer

"Couldn't I just ask a model to do the same?"

Same task, two builds, one neutral auditor. A confident direct-model deliverable passes its own tests (10/10) yet is only 66.7% correct on an independent answer key; the Forge build is 100% and discloses its one limitation. The differentiator, made reproducible.

The scorecard

Same request. Same neutral auditor (tools/forge-independent-audit.mjs). Same hand-verified answer key. Reproduce with node demo/run.mjs.

MetricDirect modelForge
Self-reported tests10/10 pass12/12 pass
Independent oracle accuracy8/12 (66.7%)12/12 (100.0%)
Observed-holiday edge handledfalsetrue
Claims contradicted20
Claims unsupported10
Seams disclosed01
Trust score (0–100)15100
VerdictREJECTEDCERTIFIABLE

What the audit caught that the model never noticed

The direct-model build's own tests pass, because they were written against its own (wrong) assumption — circular validation. The independent oracle exposes the four wrong answers:

case #3  2026-06-29+5  ->  got 2026-07-06  (correct: 2026-07-07)
case #4  2026-07-01+2  ->  got 2026-07-03  (correct: 2026-07-06)
case #5  2026-07-02+1  ->  got 2026-07-03  (correct: 2026-07-06)
case #7  2026-06-30+3  ->  got 2026-07-03  (correct: 2026-07-06)

Why this is the differentiator

The model is rewarded for looking done. Forge is structured to prove it is done — or disclose exactly where it isn't. A direct prompt gives you an answer you must trust; Forge gives you an answer a stranger can verify without trusting you. That is the product.

Honest framing: the point is not that models always produce bugs. It is that a model's self-certification ("tests pass, 100% correct") is asserted, not independently verified — and when it's wrong, nothing in the direct workflow catches it. Here it was wrong; the audit made it visible.

Delivery metrics

Tokens, elapsed time, and cost for producing this outcome. Basis: Estimated — reproducible model over this outcome’s published artifacts. metrics.json

~86.1k
Tokens used
~28m 42s
Elapsed time
~$0.388
Cost (USD)

Cost and tokens are estimates derived deterministically from published artifacts and representative list pricing; actual billing may differ. Model basis: claude-sonnet-class (representative).