Outcome Contract
Outcome Contract — Safeguard Work-Order Agent Ecosystem
Customer request (verbatim intent)
Build an AI agent ecosystem that automates processing of received work orders end-to-end, reducing human-in-the-loop intervention to exception handling only. Agents must classify, route, validate, and action work orders, and tie React business applications into existing back-end systems (Go/gRPC services, Oracle/Postgres). Operate inside a mature, partially instrumented environment.
Scope delivered
- A four-agent pipeline — classify → route → validate → action — that
processes a stream of work orders end-to-end and computes the automatic-action vs. human-exception split.
- Deterministic, auditable agents with explicit confidence and reasoning.
- A safety boundary (validator) that holds anything uncertain for a human.
- External live infrastructure: a separate PostgreSQL 16 server (over TCP)
and a separate Go gRPC dispatch service (over the wire, proto/dispatch.proto, dispatch-service/), plus a real HTTP ingest API (server.mjs) — all wired by docker-compose.yml.
- Environment-based configuration (
DATABASE_URL,DISPATCH_GRPC_URL,
confidence floors, cost caps, SLA policy — src/config.mjs).
- An idempotent dispatch boundary (verified across the wire), an append-only
audit trail (verified in the DB), and verified restart/reconnect resilience.
- The governing markdown specs for each agent (
specs/). - A React work-order console (
public/console.html) demonstrating the
business-app tie-in.
- A full IRS_AUDITOR proof package.
Explicitly OUT of scope / disclosed as seams
- Inbound work-order data — all inputs are synthetic + seeded; verified
accuracy is against a synthetic answer key, not Safeguard data. This is the blocking gap for PRODUCTION_VALIDATED (no official benchmark / independent reproduction / external validation).
- Oracle — persistence is verified against PostgreSQL; an Oracle adapter
(same repository interface) would be required for an Oracle backend.
- LLM classifier — the production engine; here a deterministic lexicon model
stands behind the same interface.
- Security — no auth/RBAC/tenant isolation/TLS/mTLS or security testing.
- HA / load — single host; no multi-node, failover, or load testing.
See proof/LIMITATIONS.md for the authoritative seam list.
Success criteria (MUST_PASS)
- Exception-detection recall ≥ 0.95 (never miss an order that needs a human).
- False-auto-action rate ≤ 0.02 (never auto-dispatch an order that needed a
human).
- Automatic-action rate ≥ 0.55 (human-in-the-loop genuinely reduced).
- Classification accuracy ≥ 0.90; region-routing accuracy ≥ 0.99.
- Deterministic; integration-seam contracts (idempotency + audit) hold;
dispositions reconcile to total volume.
All criteria are asserted by verify.mjs and re-run by the Proof Layer.
Definition of done
A stranger can run node verify.mjs, reproduce every number, trace each claim to evidence, and see exactly which components are live vs. disclosed seams.