Safeguard Work-Order Agent Ecosystem

Executive Evidence

← Back to outcome

Executive Evidence — Work-Order Agent Ecosystem

PROOF_STANDARD = IRS_AUDITOR. This document answers the ten mandatory questions with objective evidence. Every number traces through proof/CLAIM_EVIDENCE.json → proof/evidence/verification-report.json, produced by node verify.mjs against external live services and recorded in proof/EXECUTION_TRACE.json.

1. What exactly is being claimed?

A four-agent pipeline (classify → route → validate → action) processes work orders end-to-end over external live infrastructure — a separate PostgreSQL server (over TCP) and a separate Go gRPC dispatch service (over the wire) — and, on a synthetic labeled corpus of 600 orders:

  • Classification accuracy 98.87%; priority accuracy 96.42%;

region-routing accuracy 100%.

  • Exception-detection recall 1.0, precision 0.974.
  • False-auto-action rate 0.0% (0 / 600).
  • Automatic-action rate 62.17% — human-in-the-loop reduced to ~37.8%.
  • Every order persisted to the external Postgres; one append-only audit row each;

dispatch idempotent across the gRPC wire; malformed requests rejected by the Go server; the client reconnects after the Go service restarts; data survives an external Postgres restart; the HTTP API ingests/persists/serves orders.

2. What evidence supports each claim?

verify.mjs runs 23 checks against the external stack and writes verification-report.json (copied to proof/evidence/). Each check maps to a claim in CLAIM_EVIDENCE.json. Raw run output is in proof/evidence/verify.log.

3. Can an independent engineer reproduce this claim?

Yes. proof/REPRODUCE.md gives exact commands (docker compose up -d --build, then DATABASE_URL=… DISPATCH_GRPC_URL=… node verify.mjs). Agent logic is deterministic (seed 42); live DB row counts, resilience, and durability are asserted at run time; proof/CHECKSUMS.json pins every input. Note: this is reproducibility of the build, not an independent third-party reproduction of results on real data (see Q9).

4. What assumptions were made?

  • Inbound text resembles short maintenance descriptions with occasional zone,

cost, and priority cues.

  • A single trade per order is the normal case; multi-trade requests are

exceptions.

  • A $5,000 auto-approval cap and 24h duplicate window are reasonable defaults

(now environment-configurable).

5. What limitations exist?

See proof/LIMITATIONS.md. Headline: inbound data is synthetic (the blocking gap for PRODUCTION_VALIDATED); Oracle is not implemented (Postgres is); the LLM classifier is a deterministic stand-in; there is no auth/RBAC/TLS/security testing; single-host only.

6. What seams exist?

Synthetic inbound data; Oracle adapter (absent; Postgres used); LLM classifier (deterministic stand-in); security/auth/TLS (absent); React console (client-side reimplementation).

7. What was actually executed?

docker compose up -d --build started a PostgreSQL 16 server and a Go gRPC service (built from proto/dispatch.proto). node verify.mjs then processed 600 synthetic orders end-to-end against those external services across the wire, exercised the HTTP API, idempotent replay and malformed rejection on the Go server, restarted both containers to prove reconnect + durability, and confirmed determinism — 23 checks, all passing.

8. What was inferred?

Nothing was inferred into the metrics: every number is computed from the live run. Real-world accuracy is not inferred from synthetic accuracy — it is left open as a seam.

9. What remains unverified?

Accuracy on real Safeguard work orders (no official benchmark); independent third-party reproduction; external validation; an Oracle deployment; the LLM classifier; security/auth/TLS; multi-node HA/load. These are exactly the items required for PRODUCTION_VALIDATED and customer deployment (see LIMITATIONS).

10. What evidence would invalidate this claim?

A failing check in verify.mjs; a checksum mismatch (node tools/forge-proof-verify.mjs --outcome delivery-package/work-order-agents); a non-deterministic re-run; a DB row-count mismatch; a failed reconnect/durability assertion; or any claim lacking a source in CLAIM_EVIDENCE.json.