Safeguard Work-Order Agent Ecosystem

Outcome Contract

Outcome Contract — Safeguard Work-Order Agent Ecosystem

Customer request (verbatim intent)

Build an AI agent ecosystem that automates processing of received work orders end-to-end, reducing human-in-the-loop intervention to exception handling only. Agents must classify, route, validate, and action work orders, and tie React business applications into existing back-end systems (Go/gRPC services, Oracle/Postgres). Operate inside a mature, partially instrumented environment.

Scope delivered

A four-agent pipeline — classify → route → validate → action — that

processes a stream of work orders end-to-end and computes the automatic-action vs. human-exception split.

Deterministic, auditable agents with explicit confidence and reasoning.
A safety boundary (validator) that holds anything uncertain for a human.
External live infrastructure: a separate PostgreSQL 16 server (over TCP)

and a separate Go gRPC dispatch service (over the wire, proto/dispatch.proto, dispatch-service/), plus a real HTTP ingest API (server.mjs) — all wired by docker-compose.yml.

Environment-based configuration (DATABASE_URL, DISPATCH_GRPC_URL,

confidence floors, cost caps, SLA policy — src/config.mjs).

An idempotent dispatch boundary (verified across the wire), an append-only

audit trail (verified in the DB), and verified restart/reconnect resilience.

The governing markdown specs for each agent (specs/).
A React work-order console (public/console.html) demonstrating the

business-app tie-in.

A full IRS_AUDITOR proof package.

Explicitly OUT of scope / disclosed as seams

Inbound work-order data — all inputs are synthetic + seeded; verified

accuracy is against a synthetic answer key, not Safeguard data. This is the blocking gap for PRODUCTION_VALIDATED (no official benchmark / independent reproduction / external validation).

Oracle — persistence is verified against PostgreSQL; an Oracle adapter

(same repository interface) would be required for an Oracle backend.

LLM classifier — the production engine; here a deterministic lexicon model

stands behind the same interface.

Security — no auth/RBAC/tenant isolation/TLS/mTLS or security testing.
HA / load — single host; no multi-node, failover, or load testing.

See proof/LIMITATIONS.md for the authoritative seam list.

Success criteria (MUST_PASS)

Exception-detection recall ≥ 0.95 (never miss an order that needs a human).
False-auto-action rate ≤ 0.02 (never auto-dispatch an order that needed a

human).

Automatic-action rate ≥ 0.55 (human-in-the-loop genuinely reduced).
Classification accuracy ≥ 0.90; region-routing accuracy ≥ 0.99.
Deterministic; integration-seam contracts (idempotency + audit) hold;

dispositions reconcile to total volume.

All criteria are asserted by verify.mjs and re-run by the Proof Layer.

Definition of done

A stranger can run node verify.mjs, reproduce every number, trace each claim to evidence, and see exactly which components are live vs. disclosed seams.