Executive Evidence
Executive Evidence — Work-Order Agent Ecosystem
PROOF_STANDARD = IRS_AUDITOR. This document answers the ten mandatory questions with objective evidence. Every number traces through proof/CLAIM_EVIDENCE.json → proof/evidence/verification-report.json, produced by node verify.mjs against external live services and recorded in proof/EXECUTION_TRACE.json.
1. What exactly is being claimed?
A four-agent pipeline (classify → route → validate → action) processes work orders end-to-end over external live infrastructure — a separate PostgreSQL server (over TCP) and a separate Go gRPC dispatch service (over the wire) — and, on a synthetic labeled corpus of 600 orders:
- Classification accuracy 98.87%; priority accuracy 96.42%;
region-routing accuracy 100%.
- Exception-detection recall 1.0, precision 0.974.
- False-auto-action rate 0.0% (0 / 600).
- Automatic-action rate 62.17% — human-in-the-loop reduced to ~37.8%.
- Every order persisted to the external Postgres; one append-only audit row each;
dispatch idempotent across the gRPC wire; malformed requests rejected by the Go server; the client reconnects after the Go service restarts; data survives an external Postgres restart; the HTTP API ingests/persists/serves orders.
2. What evidence supports each claim?
verify.mjs runs 23 checks against the external stack and writes verification-report.json (copied to proof/evidence/). Each check maps to a claim in CLAIM_EVIDENCE.json. Raw run output is in proof/evidence/verify.log.
3. Can an independent engineer reproduce this claim?
Yes. proof/REPRODUCE.md gives exact commands (docker compose up -d --build, then DATABASE_URL=… DISPATCH_GRPC_URL=… node verify.mjs). Agent logic is deterministic (seed 42); live DB row counts, resilience, and durability are asserted at run time; proof/CHECKSUMS.json pins every input. Note: this is reproducibility of the build, not an independent third-party reproduction of results on real data (see Q9).
4. What assumptions were made?
- Inbound text resembles short maintenance descriptions with occasional zone,
cost, and priority cues.
- A single trade per order is the normal case; multi-trade requests are
exceptions.
- A $5,000 auto-approval cap and 24h duplicate window are reasonable defaults
(now environment-configurable).
5. What limitations exist?
See proof/LIMITATIONS.md. Headline: inbound data is synthetic (the blocking gap for PRODUCTION_VALIDATED); Oracle is not implemented (Postgres is); the LLM classifier is a deterministic stand-in; there is no auth/RBAC/TLS/security testing; single-host only.
6. What seams exist?
Synthetic inbound data; Oracle adapter (absent; Postgres used); LLM classifier (deterministic stand-in); security/auth/TLS (absent); React console (client-side reimplementation).
7. What was actually executed?
docker compose up -d --build started a PostgreSQL 16 server and a Go gRPC service (built from proto/dispatch.proto). node verify.mjs then processed 600 synthetic orders end-to-end against those external services across the wire, exercised the HTTP API, idempotent replay and malformed rejection on the Go server, restarted both containers to prove reconnect + durability, and confirmed determinism — 23 checks, all passing.
8. What was inferred?
Nothing was inferred into the metrics: every number is computed from the live run. Real-world accuracy is not inferred from synthetic accuracy — it is left open as a seam.
9. What remains unverified?
Accuracy on real Safeguard work orders (no official benchmark); independent third-party reproduction; external validation; an Oracle deployment; the LLM classifier; security/auth/TLS; multi-node HA/load. These are exactly the items required for PRODUCTION_VALIDATED and customer deployment (see LIMITATIONS).
10. What evidence would invalidate this claim?
A failing check in verify.mjs; a checksum mismatch (node tools/forge-proof-verify.mjs --outcome delivery-package/work-order-agents); a non-deterministic re-run; a DB row-count mismatch; a failed reconnect/durability assertion; or any claim lacking a source in CLAIM_EVIDENCE.json.