"Couldn't I just ask a model to do the same?"
Same task, two builds, one neutral auditor. A confident direct-model deliverable passes its own tests (10/10) yet is only 66.7% correct on an independent answer key; the Forge build is 100% and discloses its one limitation. The differentiator, made reproducible.
The scorecard
Same request. Same neutral auditor (tools/forge-independent-audit.mjs). Same hand-verified answer key. Reproduce with node demo/run.mjs.
| Metric | Direct model | Forge |
|---|---|---|
| Self-reported tests | 10/10 pass | 12/12 pass |
| Independent oracle accuracy | 8/12 (66.7%) | 12/12 (100.0%) |
| Observed-holiday edge handled | false | true |
| Claims contradicted | 2 | 0 |
| Claims unsupported | 1 | 0 |
| Seams disclosed | 0 | 1 |
| Trust score (0–100) | 15 | 100 |
| Verdict | REJECTED | CERTIFIABLE |
What the audit caught that the model never noticed
The direct-model build's own tests pass, because they were written against its own (wrong) assumption — circular validation. The independent oracle exposes the four wrong answers:
case #3 2026-06-29+5 -> got 2026-07-06 (correct: 2026-07-07) case #4 2026-07-01+2 -> got 2026-07-03 (correct: 2026-07-06) case #5 2026-07-02+1 -> got 2026-07-03 (correct: 2026-07-06) case #7 2026-06-30+3 -> got 2026-07-03 (correct: 2026-07-06)
Why this is the differentiator
The model is rewarded for looking done. Forge is structured to prove it is done — or disclose exactly where it isn't. A direct prompt gives you an answer you must trust; Forge gives you an answer a stranger can verify without trusting you. That is the product.
Honest framing: the point is not that models always produce bugs. It is that a model's self-certification ("tests pass, 100% correct") is asserted, not independently verified — and when it's wrong, nothing in the direct workflow catches it. Here it was wrong; the audit made it visible.
Delivery metrics
Tokens, elapsed time, and cost for producing this outcome. Basis: Estimated — reproducible model over this outcome’s published artifacts. metrics.json
Cost and tokens are estimates derived deterministically from published artifacts and representative list pricing; actual billing may differ. Model basis: claude-sonnet-class (representative).