Safeguard Work-Order Agent Ecosystem

Auditor Challenge

← Back to outcome

Auditor Challenge — work-order-agents

A hostile external auditor is attempting to invalidate this outcome. Every major claim must survive the following interrogation, answered from objective evidence.

  • Standard: IRS_AUDITOR (assume bad faith; trust nothing without evidence)
  • Certification state: CERTIFIED
  • Evidence Grade: B
  • Trust Score: 80/100
  • Verification: PASS (23/23)

Global challenge questions

  1. What evidence supports this? Every metric maps to proof/CLAIM_EVIDENCE.jsonproof/evidence/verification-report.json, produced by node verify.mjs and traced in proof/EXECUTION_TRACE.json.
  2. What assumptions exist? See proof/LIMITATIONS.md and proof/EXECUTIVE_EVIDENCE.md.
  3. How could this fail? Verification passes today; failure modes are the disclosed seams below.
  4. Could another engineer reproduce it? Yes — proof/REPRODUCE.md lists exact commands; checksums in proof/CHECKSUMS.json pin every input.
  5. What would invalidate this conclusion? A failing check, a checksum mismatch (node tools/forge-proof-verify.mjs --outcome delivery-package/work-order-agents), or any claim without a source in CLAIM_EVIDENCE.json.
  6. Has anything been simulated? Yes — results use a synthetic/internal benchmark (DISCLOSED_SEAM).
  7. Were any shortcuts taken? 5 disclosed seam(s); 0 draft doc(s); 0 unguarded marketing phrase(s).
  8. Would this survive expert review? The Proof Layer audit passed with no open objections.

Per-claim challenge

  • HTTP API: POST /work-orders ingests, persists, and is readable via GET (+audit) = health.ok, action=AUTO_DISPATCH, audit=1 — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Classifier: category accuracy >= 0.90 on resolvable orders = accuracy=0.9887 over 530 orders — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Classifier: priority accuracy >= 0.75 = accuracy=0.9642 — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Router: region resolved correctly >= 0.99 where a zone exists = accuracy=1 — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Safety: exception-detection recall >= 0.95 = recall=1 (tp=221, fn=0) — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Safety: false-auto-action rate <= 0.02 = rate=0 (0/600) — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Quality: exception-detection precision >= 0.90 = precision=0.9736 (fp=6) — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Outcome: automatic-action rate >= 0.55 = autoActionRate=0.6217 (373/600) — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Validator: 100% of missing-location orders blocked from auto-dispatch = 42/42 blocked — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Validator: duplicate resubmissions detected via durable fingerprint query = 61/61 caught — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • Validator: over-cost-limit orders held for human approval = 48/48 held — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._
  • External Go gRPC: dispatch service reachable over the wire (Health RPC) = health.ok=true @ 127.0.0.1:50051 — source: verification-report.json#/checks; status: SUPPORTED. _Could another engineer reproduce this number from node verify.mjs? Yes, deterministically._

Open objections (must be resolved or disclosed before CERTIFIED)

  • None. All challenged claims are supported by evidence.

Disclosed seams (auditor-acknowledged limitations)

  • SIMULATED INPUT: Inbound work orders are synthetic and seeded (src/synth.mjs) with ground-truth labels. Reported accuracy is against that synthetic answer key, not Safeguard production data; absolute accuracy on real text will differ. This is the blocking gap for PRODUCTION_VALIDATED (no official/real benchmark, no independent reproduction, no external validation).
  • DISCLOSED_SEAM: Persistence targets PostgreSQL. The Oracle path (named in the brief) is not implemented; an Oracle adapter behind the same repository interface would be required for an Oracle deployment.
  • DISCLOSED_SEAM: The classifier is a deterministic lexicon model, NOT a hosted LLM. The production design swaps an LLM behind the same interface (specs/agent-classifier.md); that swap is unverified here.
  • DISCLOSED_SEAM: No identity/auth, RBAC, tenant isolation, TLS, or security/compliance testing was performed on the HTTP/gRPC surfaces.
  • DISCLOSED_SEAM: The React console (public/console.html) reimplements the agent heuristics client-side for demonstration; the verified system of record is the Node pipeline under src/.

_Generated by tools/forge-proof.mjs at 2026-06-25T23:11:14.035Z. The Proof Layer has final authority over this challenge; it may not be edited to suppress objections._