Agent Memory Manager
Verification Report
Verification Report — Hierarchical Agent Memory Manager
Strictness: IRS_AUDITOR | Proof status: public API + policies verified over the in-memory store AND a real Postgres+pgvector engine (PGlite); inbound benchmark corpus is synthetic/seeded
Checks: PASS 30 / 30 (100%) | Generated: 2026-06-26T14:11:10.274Z
Infrastructure: database engine PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.74 (1092ec30a3fb1d46b1782ff1b4db5094d3d06ae5), 32-bit, pgvector 0.8.0, embeddings hashing-local, summarizer extractive-local.
Disclosed seams & limitations
- SYNTHETIC INPUT: the retrieval benchmark corpus is generated by a seeded PRNG with a known topic answer key (verify.mjs). Reported precision is against that synthetic key, NOT real production text; absolute accuracy on real agent traffic will differ. This is the blocking gap for an official benchmark / PRODUCTION_VALIDATED.
- DISCLOSED_SEAM: the default embeddings are a local, deterministic hashing model (lexical, not learned-semantic). Cosine similarity tracks token overlap. A hosted semantic embedding model can be plugged in via RemoteEmbeddingProvider but is NOT exercised here.
- DISCLOSED_SEAM: the default summarizer is a local extractive (frequency-based) summarizer. ClaudeSummarizer (hosted LLM) is provided behind the same interface but requires a network + ANTHROPIC_API_KEY and is NOT exercised here.
- LIVE INFRASTRUCTURE (in-process): Postgres + pgvector run in-process via PGlite (WASM). The identical SQL/pgvector code path runs against an external Postgres server through node-postgres (pg) — a wire-compatible disclosed seam not exercised in this run.
- DISCLOSED_SEAM: fleet sync is exercised over an in-process EventEmitter bus. A distributed broker (Redis/NATS/Kafka) implementing the same SyncBus interface is a disclosed seam, not exercised here.
- DISCLOSED_SEAM: the cross-encoder reranker (LocalCrossEncoderReranker) requires the optional @xenova/transformers dependency + a one-time model download. verify.mjs exercises the reranker INTERFACE with a deterministic fake; the real MiniLM cross-encoder is measured separately in the official BEIR/SciFact benchmark (bench/beir-scifact.mjs), whose results are recorded in officialBenchmark.
Synthetic retrieval benchmark
| Metric | Value |
|---|---|
| Memories / topics | 144 / 12 (seed 20260625) |
| In-memory precision@1 / @5 | 1 / 1 |
| pgvector precision@1 / @5 | 1 / 1 |
| Store top-1 agreement (in-mem vs pgvector) | 1 |
| pgvector vs JS cosine abs error | 2e-8 |
Checks
| Check | Detail | Result |
|---|---|---|
| Unit suite: node:test passes (policies, manager API, tiering, sync, pgvector) | tests passed=32 fail=0 | PASS |
| In-memory retrieval precision@1 == 1.0 on the synthetic answer key | p@1=1 | PASS |
| In-memory retrieval precision@5 >= 0.95 | p@5=1 | PASS |
| Deterministic retrieval: identical query returns identical ordering | mem-0-4,mem-0-8,mem-0-7,mem-0-9,mem-0-10 == mem-0-4,mem-0-8,mem-0-7,mem-0-9,mem-0-10 | PASS |
| BM25 sparse index ranks an exact rare-token match first | top=rare | PASS |
| Reciprocal Rank Fusion ranks the multiply-agreed item first | top=x | PASS |
| Hybrid retrieval (dense+BM25 RRF) surfaces the exact-token memory first | top=rare | PASS |
| Reranker reorders candidates to surface the cross-encoder-preferred memory | top=target | PASS |
| pgvector retrieval precision@1 == 1.0 on the synthetic answer key | p@1=1 | PASS |
| pgvector retrieval precision@5 >= 0.95 | p@5=1 | PASS |
| pgvector cosine distance agrees with JS cosine similarity (<=1e-5) | pg=0.48795 js=0.48795 | PASS |
| pgvector persistence: every memory persisted to Postgres | rows=144 == 144 | PASS |
| Store parity: in-memory and pgvector agree on the top-1 topic for every query | 12/12 | PASS |
| summarizeThread produces a non-empty summary and persists a summary memory | len=200, persisted=true, type=summary | PASS |
| summarizeThread: summary is retrievable from the store | found=true | PASS |
| getContextForTask respects the character budget | chars=226 <= 300(+1 line) | PASS |
| getContextForTask prioritizes the summary memory | firstType=summary | PASS |
| Time decay demotes a hot memory below the hot importance floor | hot -> warm | PASS |
| Hot tier capacity enforced (overflow demoted) | hot=4 <= 4 | PASS |
| Cold tier eviction caps size and emits eviction metric | cold=3 <= 3, evicted=5 | PASS |
| Fleet sync: fleet-scoped memory replicates to a peer node | replicated=true | PASS |
| Fleet sync: local-scoped memory does NOT replicate | peerHasPrivate=false | PASS |
| Metrics: counters and latency histograms are recorded | storeCounter=true, retrieveHistogram=true | PASS |
| Reproducible: same seed regenerates an identical synthetic corpus | 6cd045a8ca54 == 6cd045a8ca54 | PASS |
| Official benchmark: full BEIR/SciFact corpus + test qrels evaluated with a real learned model | BEIR / SciFact docs=5183 queries=300 model=transformer:Xenova/all-MiniLM-L6-v2 | PASS |
| Official benchmark: BEIR/SciFact nDCG@10 reflects real semantic retrieval (>= 0.40, far above chance) | nDCG@10=0.6927 | PASS |
| Official benchmark: BEIR/SciFact Recall@10 >= 0.40 | recall@10=0.8222 | PASS |
| Official benchmark: dataset checksums recorded for independent reproduction | corpus=dec31c8182f3 | PASS |
| Official benchmark: hybrid (dense+BM25 RRF) does not regress dense on nDCG@10 (same corpus/qrels) | dense=0.6443 -> hybrid=0.6902 (delta=0.0459) | PASS |
| Official benchmark: hybrid+cross-encoder rerank is competitive with hybrid on nDCG@10 | hybrid=0.6902 -> rerank=0.6927 (delta=0.0025) | PASS |