Impact Study
Forge Property Management — Work Order Automation Impact Study
## ⚠ FICTIONAL / SYNTHETIC DEPLOYMENT MODEL
"Forge Property Management" is an invented enterprise customer. This study is
a synthetic deployment model, not a real customer deployment, and it makes
no claim of real production results. Every operational number comes from a
deterministic synthetic corpus of 100,000 work orders; every financial number
comes from stated illustrative assumptions (seeassumptions.json). The
methodology is fully transparent and reproducible.
_Generated from verification-report.json on 2026-06-26T00:25:09.291Z. Simulation-tier label: FICTIONAL_DEPLOYMENT_MODEL_CERTIFIED._
1. Executive summary
A fictional enterprise property manager — Forge Property Management, 42,000 units across Florida, Georgia, Texas, and North Carolina — currently routes **100,000 work orders a year through a 14-person coordination team that reviews every** order by hand (≈8 minutes each). This study simulates replacing that fully-manual triage with the Work Order Agent Ecosystem and measures the result against a known answer key.
Across the full 100,000-order synthetic corpus, run through the real ecosystem (real PostgreSQL + real gRPC dispatch):
- 56.6% of work orders were actioned automatically
(classified, routed, validated, and dispatched with no human touch).
- 27.3% were routed to humans as exceptions, and
16.1% were rejected back for missing information.
- The safety-critical false-auto-action rate was 0.00%
on the synthetic key, with 100.0% exception recall.
- 100.0% of duplicate resubmissions were suppressed,
and 100.0% of orders carried a complete, append-only audit trail.
- Under the stated assumptions, the model shows $369,392
in annual labor savings, a 8.69-month payback, and a 110.28% three-year ROI.
These are simulated figures intended to size the opportunity honestly — not a guarantee of real-world performance.
2. Baseline process (the fictional "before")
| Attribute | Value |
|---|---|
| Annual work orders | 100,000 |
| Review model | 100% manual — every order reviewed by a coordinator |
| Average handling time | 8 minutes/order |
| Coordination team | 14 people |
| Fully-loaded coordinator cost | $38/hour |
| Annual manual labor | 13,333.33 hours = $506,667 |
Stated pain points: slow routing, duplicate tickets, inconsistent vendor assignment, SLA misses, and poor auditability — exactly the failure modes the ecosystem's validator and audit layer target.
3. Agent architecture
The simulation runs the unmodified Work Order Agent Ecosystem (delivery-package/work-order-agents): four agents in sequence, over real infrastructure.
- Classifier — trade category + priority + field extraction with calibrated
confidence (deterministic lexicon engine; an LLM is a disclosed seam).
- Router — table-driven routing to queue, crew, vendor tier, region, and SLA.
- Validator — the safety boundary: required fields, confidence floors, cost
cap, and durable duplicate detection. Anything uncertain becomes a human exception.
- Actioner — auto-dispatch vs. human exception vs. reject, with idempotent
gRPC dispatch and an append-only audit entry for every order.
Infrastructure for this run: database engine pglite-memory, gRPC dispatch at 127.0.0.1:55092.
4. Simulation methodology
- Inputs: 100,000 synthetic work orders generated by a seeded PRNG
(seed 20260625) with a ground-truth answer key. Corpus fingerprint: 6cbb9cc1149dd85f1c4324db….
- Execution: each order flows through classify → route → validate → action over
a real PostgreSQL engine and a real gRPC dispatch service.
- Scoring: accuracy, recall, precision, and the false-auto-action rate are
measured against the answer key; persistence and idempotency are checked against the live database and the gRPC wire.
- Determinism: the same seed reproduces the same corpus and therefore the same
metrics (asserted by a fingerprint check).
The corpus mix (operational realities modeled):
| Pattern | Count | Models |
|---|---|---|
| clean | 49,865 | clear single-trade, valid → auto-dispatch |
| emergency | 7,914 | P1 safety case → escalated auto-dispatch |
| ambiguous | 8,194 | two trades, weak signal → human exception |
| highCost | 7,961 | over the auto-approval cap → human exception |
| duplicate | 10,009 | resubmission → suppressed, human exception |
| missingLoc | 9,058 | no resolvable unit/zone → rejected |
| missingField | 6,999 | description too short → rejected |
5. Synthetic data disclosure
All data is machine-generated. No real customer, property, tenant, vendor, cost, or work order is represented. The full corpus is written to data/enterprise-work-orders.jsonl (90.4 MB, sha256 57504af7ffbbc0a9a22af2a2…); a 1,000-row sample is shipped in datasets/sample-1000.jsonl with a schema in datasets/dataset-card.md. Reported accuracy is against the synthetic answer key — real tenant text is messier, so absolute accuracy in a real deployment would differ.
6. Operational results
| Metric | Result |
|---|---|
| Work orders processed | 100,000 |
| Classification accuracy | 100.0% |
| Priority accuracy | 100.0% |
| Routing accuracy (region) | 100.0% |
| Auto-action rate | 56.6% |
| Human exception rate | 27.3% |
| Rejection rate | 16.1% |
| Needs-review rate | 27.3% |
| False-auto-action rate | 0.00% |
| Exception precision / recall / F1 | 0.974 / 1 / 0.9868 |
| Duplicate suppression | 100.0% |
| SLA routing performance | 100.0% |
| Emergency escalation | 100.0% |
| Audit completeness | 100.0% |
| Avg processing time | 3.7161 ms/order (269/s) |
Dispositions: 56,651 auto-dispatched, 27,292 human exceptions, 16,057 rejected. Persistence (real DB rows): work_orders 100,000, audit_log 100,000, dispatch_records 56,651.
7. Financial impact (illustrative ROI model)
All inputs are stated assumptions (assumptions.json); the only measured input is the auto-action rate above. Every line shows its arithmetic in evidence/roi.json.
| Line | Value |
|---|---|
| Manual baseline labor | 13,333.33 h → $506,667/yr |
| Exceptions still needing a human | 43,350 orders × 5 min |
| Agent-assisted exception labor | 3,612.5 h → $137,275/yr |
| Annual labor savings | $369,392 |
| Coordinator hours recovered | 9,720.83 h (4.67 FTE) |
| Coordinator capacity recovered | 12.26 of 14 FTE |
| Implementation (one-time) | $185,000 |
| Platform (annual) | $114,000 |
| First-year net savings | $70,392 |
| Payback period | 8.69 months |
| 3-year gross savings | $1,108,175 |
| 3-year total cost | $527,000 |
| 3-year net savings | $581,175 |
| 3-year ROI | 110.28% |
8. Risk controls
- Conservative validator. Low-confidence, over-cost, duplicate, and missing-field
orders are never auto-actioned — they go to a human. The simulated false-auto-action rate is 0.00%.
- Human-in-the-loop for exceptions. 43.4% of volume is
retained for human judgment by design.
- Idempotent dispatch. Retries never double-dispatch (verified on the gRPC wire).
- Malformed-payload rejection. The dispatch service rejects contract violations.
- Determinism. Identical inputs always produce identical outputs.
9. Auditability
Every one of the 100,000 orders produces an append-only audit_log row (100.0% completeness) capturing the action, reason, and decision detail, plus a durable work_orders record and, for auto-dispatches, a dispatch_records row with a deterministic reference. A sample audit trail (auto-dispatched + exception) is exported to evidence/audit-trace-sample.json.
10. Limitations
This is a synthetic model. Key disclosed seams:
- FICTIONAL CUSTOMER: "Forge Property Management" is an invented enterprise. No real customer relationship, deployment, or contract exists. This is a synthetic deployment model, not a production case study.
- SYNTHETIC DATA: All 100,000 work orders are generated by a seeded PRNG (src/enterprise-synth.mjs) with a ground-truth answer key. Reported accuracy is against that synthetic key, not real tenant text; absolute accuracy on real intake will differ.
- SIMULATED OPERATIONS & ROI: Manual handling time, exception review time, coordinator cost, implementation cost, and platform cost are stated illustrative assumptions (assumptions.json), not measured production figures. The ROI is a transparent model, not a realized financial result.
- LIVE INFRASTRUCTURE (in-process): Persistence is the real PostgreSQL engine via PGlite (in-memory for this run) and dispatch crosses a real gRPC/HTTP2 wire to a Node service on localhost. An external Postgres (DATABASE_URL) and a Go gRPC service are wire-compatible disclosed seams, not exercised here.
- DISCLOSED_SEAM: The classifier is a deterministic lexicon model, not a hosted LLM. The production design swaps an LLM behind the same interface; that swap is unverified here.
- AT-LEAST-ONCE DISPATCH: Under load a Dispatch RPC can occasionally exceed its client deadline after the gRPC server has already committed the dispatch row. The actioner escalates those orders to a human exception (never double-dispatches, never silently drops), so dispatch_records can slightly exceed the auto-dispatched count. The exact orphan count is transport-timing dependent and not bit-identical across runs; the reconciliation identity (dispatch_records = auto-dispatched + safely-escalated orphans) holds every run.
See proof/LIMITATIONS.md for the full list.
11. Production-readiness roadmap
- Integrate real work-order intake (portal/email/phone/IoT) in place of the
synthetic generator.
- Move to an external managed PostgreSQL (
DATABASE_URL) and a deployed gRPC
dispatch service (Go implementation of proto/dispatch.proto).
- Connect real vendor/dispatch systems and the customer's SLA policy.
- Optionally swap the lexicon classifier for an LLM behind the same interface and
re-verify accuracy on the customer's real text.
- Add enterprise non-functional controls (identity/SSO, RBAC, tenant isolation,
audit retention, security/compliance review).
- Run a shadow period against real volume, then a limited live pilot, before any
claim of real production performance.
12. Recommended rollout plan
| Phase | Duration (illustrative) | Scope | Exit criteria |
|---|---|---|---|
| 0 · Shadow | 4–6 weeks | Agents score real orders; humans still action everything | Accuracy + false-auto-action measured on real text |
| 1 · Assist | 4–8 weeks | Agents recommend routing; humans approve | Coordinator time/order drops; exception quality holds |
| 2 · Auto (low-risk) | 8–12 weeks | Auto-dispatch only high-confidence, in-cost, single-trade orders | False-auto-action stays within target on real data |
| 3 · Scale | ongoing | Expand auto-action coverage; humans focus on exceptions | Stable safety + audit metrics; realized savings tracked |
Each phase is gated on real measured safety metrics — not on this simulation.
_This document is generated from the verified run. Re-run node verify.mjs then node build-deliverables.mjs to reproduce it. FICTIONAL / SYNTHETIC DEPLOYMENT MODEL — no real customer data was used._