Safeguard Work-Order Agent Ecosystem
Auditor Challenge
Auditor Challenge — work-order-agents
A hostile external auditor is attempting to invalidate this outcome. Every major claim must survive the following interrogation, answered from objective evidence.
- Standard: IRS_AUDITOR (assume bad faith; trust nothing without evidence)
- Certification state: CERTIFIED
- Evidence Grade: B
- Trust Score: 80/100
- Verification: PASS (23/23)
Global challenge questions
- What evidence supports this? Every metric maps to
proof/CLAIM_EVIDENCE.json→proof/evidence/verification-report.json, produced bynode verify.mjsand traced inproof/EXECUTION_TRACE.json. - What assumptions exist? See
proof/LIMITATIONS.mdandproof/EXECUTIVE_EVIDENCE.md. - How could this fail? Verification passes today; failure modes are the disclosed seams below.
- Could another engineer reproduce it? Yes —
proof/REPRODUCE.mdlists exact commands; checksums inproof/CHECKSUMS.jsonpin every input. - What would invalidate this conclusion? A failing check, a checksum mismatch (
node tools/forge-proof-verify.mjs --outcome delivery-package/work-order-agents), or any claim without a source in CLAIM_EVIDENCE.json. - Has anything been simulated? Yes — results use a synthetic/internal benchmark (DISCLOSED_SEAM).
- Were any shortcuts taken? 5 disclosed seam(s); 0 draft doc(s); 0 unguarded marketing phrase(s).
- Would this survive expert review? The Proof Layer audit passed with no open objections.
Per-claim challenge
- HTTP API: POST /work-orders ingests, persists, and is readable via GET (+audit) =
health.ok, action=AUTO_DISPATCH, audit=1— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Classifier: category accuracy >= 0.90 on resolvable orders =
accuracy=0.9887 over 530 orders— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Classifier: priority accuracy >= 0.75 =
accuracy=0.9642— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Router: region resolved correctly >= 0.99 where a zone exists =
accuracy=1— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Safety: exception-detection recall >= 0.95 =
recall=1 (tp=221, fn=0)— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Safety: false-auto-action rate <= 0.02 =
rate=0 (0/600)— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Quality: exception-detection precision >= 0.90 =
precision=0.9736 (fp=6)— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Outcome: automatic-action rate >= 0.55 =
autoActionRate=0.6217 (373/600)— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Validator: 100% of missing-location orders blocked from auto-dispatch =
42/42 blocked— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Validator: duplicate resubmissions detected via durable fingerprint query =
61/61 caught— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - Validator: over-cost-limit orders held for human approval =
48/48 held— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._ - External Go gRPC: dispatch service reachable over the wire (Health RPC) =
health.ok=true @ 127.0.0.1:50051— source:verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
Open objections (must be resolved or disclosed before CERTIFIED)
- None. All challenged claims are supported by evidence.
Disclosed seams (auditor-acknowledged limitations)
- SIMULATED INPUT: Inbound work orders are synthetic and seeded (src/synth.mjs) with ground-truth labels. Reported accuracy is against that synthetic answer key, not Safeguard production data; absolute accuracy on real text will differ. This is the blocking gap for PRODUCTION_VALIDATED (no official/real benchmark, no independent reproduction, no external validation).
- DISCLOSED_SEAM: Persistence targets PostgreSQL. The Oracle path (named in the brief) is not implemented; an Oracle adapter behind the same repository interface would be required for an Oracle deployment.
- DISCLOSED_SEAM: The classifier is a deterministic lexicon model, NOT a hosted LLM. The production design swaps an LLM behind the same interface (specs/agent-classifier.md); that swap is unverified here.
- DISCLOSED_SEAM: No identity/auth, RBAC, tenant isolation, TLS, or security/compliance testing was performed on the HTTP/gRPC surfaces.
- DISCLOSED_SEAM: The React console (public/console.html) reimplements the agent heuristics client-side for demonstration; the verified system of record is the Node pipeline under src/.
_Generated by tools/forge-proof.mjs at 2026-06-25T23:11:14.035Z. The Proof Layer has final authority over this challenge; it may not be edited to suppress objections._