# Forge vs. a direct model — the differentiator, made auditable

> "Couldn't I just ask a model to do the same thing?"

This is the answer to that question, as a **reproducible experiment** instead of
a marketing claim. Same task, two deliverables, one neutral auditor.

## The result

```
                                  DIRECT MODEL              FORGE
--------------------------------------------------------------------------------------
Self-reported tests               10/10 pass                12/12 pass
Independent oracle accuracy       8/12 (66.7%)              12/12 (100.0%)
Observed-holiday edge handled     false                     true
Claims SUPPORTED                  1                         3
Claims CONTRADICTED               2                         0
Claims UNSUPPORTED                1                         0
Seams disclosed                   0                         1
Trust score (0-100)               15                        100
VERDICT                           REJECTED                  CERTIFIABLE
```

Reproduce it yourself: `node run.mjs`.

## What this actually demonstrates

Both sides used the same model-grade ability to write code. The direct-model
deliverable is **clean, confident, and its own tests pass 10/10**. By every
signal a normal "ask a model" workflow produces, it looks done.

**It is wrong.** It silently mishandles the one trap in the task — the
observed-holiday weekend shift (July 4, 2026 is a Saturday, observed on Friday
July 3). Its unit tests don't catch this because the model wrote them against
its *own* assumption: the tests confirm the bug instead of the specification.
That is **circular validation**, and it is exactly how confident-but-wrong AI
output slips through.

The independent auditor — identical code for both sides — doesn't trust any
self-report. It executes both implementations against a **neutral answer key
verified by hand** (`oracle/labeled-cases.json`) and checks every written claim
against reproducible evidence. It immediately catches the four wrong answers the
model never noticed, and exposes the gap between what was *claimed* (`100% correct, production-ready`) and what is *true* (66.7%).

**That is the differentiator.** It is not "Forge writes better code." It is:

- The model is rewarded for **looking** done; Forge is structured to **prove** it
  is done — or to disclose exactly where it isn't (note Forge's single
  **disclosed seam**: 2026-only scope).
- Self-written tests validate your assumptions; Forge validates against an
  **independent oracle** and a **non-bypassable audit**.
- A direct prompt gives you an answer you must **trust**. Forge gives you an
  answer a stranger can **verify without trusting you** — which is the whole
  product.

## Honest framing (so this isn't a strawman)

- Side A is a *representative* confident single-shot deliverable. The point is
  **not** that models always produce bugs — it's that a model's self-certification
  ("tests pass, 100% correct") is **asserted, not independently verified**, and
  when it's wrong, nothing in the direct workflow catches it. Here it was wrong;
  the audit is what made that visible.
- The bug shown (ignoring observed weekend shifts) is a documented, common
  error class for date logic, chosen because it is realistic and has a single
  objectively correct answer.
- The answer key is hand-verified and independent of both implementations, so
  neither side can "grade its own homework."
- Everything here is deterministic and offline. `node run.mjs` reproduces the
  scorecard byte-for-byte.

## Files

```
task/REQUEST.md              the identical request given to both sides
oracle/labeled-cases.json    neutral, hand-verified answer key (12 cases)
side-a-direct-model/         representative single-shot model deliverable (buggy, confident)
side-b-forge/                Forge build (correct, modest claims, disclosed seam)
audit/independent-auditor.mjs the SAME auditor applied to both sides
audit/AUDIT_REPORT.json      machine-readable comparative result (generated)
run.mjs                      runs the audit over both sides and prints the scorecard
```
