Executive Evidence
Executive Evidence — forge-pm-work-order-sim
FICTIONAL / SYNTHETIC DEPLOYMENT MODEL. "Forge Property Management" is an
invented customer. No real customer data was used; no real production result is
claimed. The Proof Layer (IRS_AUDITOR) assigns the authoritative certification
state independently and may not be overridden here.
This answers the ten IRS_AUDITOR questions from objective evidence.
1. What exactly is being claimed?
That the existing Work Order Agent Ecosystem, run unmodified over a deterministic synthetic corpus of 100,000 property-management work orders, achieves the operational metrics recorded in verification-report.json (auto-action rate, classification/priority/routing accuracy, exception recall/precision, false-auto-action rate, duplicate suppression, SLA routing, audit completeness, idempotency), and that a transparent ROI model over stated assumptions yields the financial figures in evidence/roi.json. Nothing about a *real* deployment is claimed.
2. What evidence supports each claim?
verification-report.json (+ evidence/verification-results.json, evidence/simulation-results.json, evidence/roi.json, evidence/audit-trace-sample.json), produced by node verify.mjs and captured in proof/EXECUTION_TRACE.json and proof/evidence/verify.log. Every shipped file is checksummed in proof/CHECKSUMS.json.
3. Can an independent engineer reproduce this claim?
Yes. The corpus is seeded (20260625); proof/REPRODUCE.md gives exact commands. A corpus-fingerprint check asserts the inputs are byte-identical across runs.
4. What assumptions were made?
The customer, portfolio, and all data are invented. Manual handling time (8 min), exception review time (5 min), coordinator cost ($38/h), implementation cost, and platform cost are stated illustrative assumptions in assumptions.json — not measured figures.
5. What limitations exist?
Synthetic text is cleaner than real intake; in-process infrastructure (PGlite + loopback gRPC) stands in for external Postgres + a Go dispatch service; the classifier is a lexicon engine, not an LLM. See proof/LIMITATIONS.md.
6. What seams exist?
The fictional customer, the synthetic data, the simulated operational/financial assumptions, the in-process infrastructure, and the lexicon classifier are all disclosed seams (listed in verification-report.json#/disclosedSeams and proof/LIMITATIONS.md).
7. What was actually executed?
node verify.mjs generated 100,000 orders, booted a real PostgreSQL engine and a real gRPC dispatch service, and ran all four agents end-to-end over every order, performing real SQL writes and real gRPC round trips. Exit code and stdout hash are in proof/EXECUTION_TRACE.json.
8. What was inferred?
The ROI is *inferred* from the measured auto-action rate plus stated assumptions — it is a model, not a realized financial result. Real-world accuracy is inferred to differ from the synthetic answer key.
9. What remains unverified?
Anything about a real deployment: real-text accuracy, external Postgres / Go gRPC behaviour, LLM-classifier accuracy, enterprise non-functional controls, and any realized savings. None are claimed.
10. What evidence would invalidate this claim?
A failing MUST_PASS check, a checksum mismatch (node tools/forge-proof-verify.mjs --outcome delivery-package/forge-pm-work-order-sim), a corpus-fingerprint mismatch, or any metric without a source in proof/CLAIM_EVIDENCE.json.