Agent Memory Manager

Executive Evidence

Executive Evidence — agent-memory-manager

Standard: IRS_AUDITOR. This document answers the ten required questions with objective evidence. Every number traces to verification-report.json, produced by node verify.mjs and captured in proof/evidence/verify.log.

1. What exactly is being claimed?

A Node.js/TypeScript module that provides hierarchical (hot/warm/cold) agent memory with:

the four API methods store, retrieve, summarizeThread,

getContextForTask (class MemoryManager), plus MemoryTier and AgentMemoryWrapper;

vector storage/retrieval on Postgres + pgvector (exercised for real via

the in-process PGlite engine; wire-compatible with an external Postgres server through pg);

time-based decay, relevance scoring (cosine similarity + heuristics), and

eviction (LRU + importance);

optional fleet sync via a pluggable event bus;
structured logs and metrics.

No claim of semantic-quality retrieval, of hosted-LLM summarization quality, or of distributed-broker behaviour is made — those are disclosed seams (§6).

2. What evidence supports each claim?

verification-report.json records 20/20 checks PASS, including: the full node:test unit suite (24 tests); synthetic retrieval precision@1 = 1.0 and precision@5 ≈ 0.98 on a seeded answer key over both the in-memory store and real pgvector; pgvector-vs-JS cosine parity within 1e-5; persistence row counts; decay-driven demotion; tier-capacity eviction with an emitted metric; fleet replication of fleet-scoped (and non-replication of local-scoped) memories; and metric/histogram recording. Claim-to-evidence lineage is in proof/CLAIM_EVIDENCE.json; raw output in proof/evidence/verify.log.

3. Can an independent engineer reproduce this claim?

Yes. proof/REPRODUCE.md lists exact commands. The corpus is generated by a seeded PRNG; verify.mjs re-derives it and checks the fingerprint is identical across runs. Checksums for every shipped file are in proof/CHECKSUMS.json (self-verify with `node tools/forge-proof-verify.mjs --outcome delivery-package/agent-memory-manager`).

4. What assumptions were made?

That cosine similarity over the default lexical hashing embeddings is an acceptable retrieval signal for the synthetic benchmark (it tracks token overlap); that PGlite's pgvector is representative of server-side pgvector SQL (the SQL is identical); and that an in-process EventEmitter is a faithful stand-in for the SyncBus contract a real broker would implement.

5. What limitations exist?

See proof/LIMITATIONS.md. In short: embeddings are lexical not learned-semantic; the default summarizer is extractive not abstractive; Postgres runs in-process (PGlite) in this run; fleet sync is in-process; the retrieval benchmark is synthetic.

6. What seams exist?

DISCLOSED_SEAMs: hosted semantic embeddings (RemoteEmbeddingProvider), hosted LLM summarization (ClaudeSummarizer), an external Postgres server (pg), and a distributed sync broker are all provided behind interfaces but are not executed by the verification suite.

7. What was actually executed?

tsc build; node --test (24 unit tests); node verify.mjs which runs the API over the in-memory store and a live PGlite Postgres+pgvector engine, the synthetic precision benchmark, decay/eviction, fleet sync, and metrics capture. proof/EXECUTION_TRACE.json records the command, exit code, and stdout sha256.

8. What was inferred (not directly executed)?

Behaviour against an external Postgres server, against hosted embedding/LLM APIs, and over a network sync broker is *inferred* from the shared interfaces and identical SQL — not executed here.

9. What remains unverified?

Retrieval quality on real (non-synthetic) agent traffic; hosted-LLM summary quality; throughput/latency at fleet scale; durability across an external DB restart; behaviour of an external sync broker. These are the gaps that keep this outcome below an official benchmark / PRODUCTION_VALIDATED.

10. What evidence would invalidate these claims?

Any failing check in verification-report.json; a checksum mismatch; a pgvector-vs-JS cosine divergence beyond 1e-5; a corpus-fingerprint mismatch across runs; or a claim in this package with no corresponding source in proof/CLAIM_EVIDENCE.json.