Agent Memory Manager

Verification Report

Verification Report — Hierarchical Agent Memory Manager

Strictness: IRS_AUDITOR | Proof status: public API + policies verified over the in-memory store AND a real Postgres+pgvector engine (PGlite); inbound benchmark corpus is synthetic/seeded

Checks: PASS 30 / 30 (100%) | Generated: 2026-06-26T14:11:10.274Z

Infrastructure: database engine PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.74 (1092ec30a3fb1d46b1782ff1b4db5094d3d06ae5), 32-bit, pgvector 0.8.0, embeddings hashing-local, summarizer extractive-local.

Disclosed seams & limitations

SYNTHETIC INPUT: the retrieval benchmark corpus is generated by a seeded PRNG with a known topic answer key (verify.mjs). Reported precision is against that synthetic key, NOT real production text; absolute accuracy on real agent traffic will differ. This is the blocking gap for an official benchmark / PRODUCTION_VALIDATED.
DISCLOSED_SEAM: the default embeddings are a local, deterministic hashing model (lexical, not learned-semantic). Cosine similarity tracks token overlap. A hosted semantic embedding model can be plugged in via RemoteEmbeddingProvider but is NOT exercised here.
DISCLOSED_SEAM: the default summarizer is a local extractive (frequency-based) summarizer. ClaudeSummarizer (hosted LLM) is provided behind the same interface but requires a network + ANTHROPIC_API_KEY and is NOT exercised here.
LIVE INFRASTRUCTURE (in-process): Postgres + pgvector run in-process via PGlite (WASM). The identical SQL/pgvector code path runs against an external Postgres server through node-postgres (pg) — a wire-compatible disclosed seam not exercised in this run.
DISCLOSED_SEAM: fleet sync is exercised over an in-process EventEmitter bus. A distributed broker (Redis/NATS/Kafka) implementing the same SyncBus interface is a disclosed seam, not exercised here.
DISCLOSED_SEAM: the cross-encoder reranker (LocalCrossEncoderReranker) requires the optional @xenova/transformers dependency + a one-time model download. verify.mjs exercises the reranker INTERFACE with a deterministic fake; the real MiniLM cross-encoder is measured separately in the official BEIR/SciFact benchmark (bench/beir-scifact.mjs), whose results are recorded in officialBenchmark.

Synthetic retrieval benchmark

Metric	Value
Memories / topics	144 / 12 (seed 20260625)
In-memory precision@1 / @5	1 / 1
pgvector precision@1 / @5	1 / 1
Store top-1 agreement (in-mem vs pgvector)	1
pgvector vs JS cosine abs error	2e-8

Checks

Check	Detail	Result
Unit suite: node:test passes (policies, manager API, tiering, sync, pgvector)	tests passed=32 fail=0	PASS
In-memory retrieval precision@1 == 1.0 on the synthetic answer key	p@1=1	PASS
In-memory retrieval precision@5 >= 0.95	p@5=1	PASS
Deterministic retrieval: identical query returns identical ordering	mem-0-4,mem-0-8,mem-0-7,mem-0-9,mem-0-10 == mem-0-4,mem-0-8,mem-0-7,mem-0-9,mem-0-10	PASS
BM25 sparse index ranks an exact rare-token match first	top=rare	PASS
Reciprocal Rank Fusion ranks the multiply-agreed item first	top=x	PASS
Hybrid retrieval (dense+BM25 RRF) surfaces the exact-token memory first	top=rare	PASS
Reranker reorders candidates to surface the cross-encoder-preferred memory	top=target	PASS
pgvector retrieval precision@1 == 1.0 on the synthetic answer key	p@1=1	PASS
pgvector retrieval precision@5 >= 0.95	p@5=1	PASS
pgvector cosine distance agrees with JS cosine similarity (<=1e-5)	pg=0.48795 js=0.48795	PASS
pgvector persistence: every memory persisted to Postgres	rows=144 == 144	PASS
Store parity: in-memory and pgvector agree on the top-1 topic for every query	12/12	PASS
summarizeThread produces a non-empty summary and persists a summary memory	len=200, persisted=true, type=summary	PASS
summarizeThread: summary is retrievable from the store	found=true	PASS
getContextForTask respects the character budget	chars=226 <= 300(+1 line)	PASS
getContextForTask prioritizes the summary memory	firstType=summary	PASS
Time decay demotes a hot memory below the hot importance floor	hot -> warm	PASS
Hot tier capacity enforced (overflow demoted)	hot=4 <= 4	PASS
Cold tier eviction caps size and emits eviction metric	cold=3 <= 3, evicted=5	PASS
Fleet sync: fleet-scoped memory replicates to a peer node	replicated=true	PASS
Fleet sync: local-scoped memory does NOT replicate	peerHasPrivate=false	PASS
Metrics: counters and latency histograms are recorded	storeCounter=true, retrieveHistogram=true	PASS
Reproducible: same seed regenerates an identical synthetic corpus	6cd045a8ca54 == 6cd045a8ca54	PASS
Official benchmark: full BEIR/SciFact corpus + test qrels evaluated with a real learned model	BEIR / SciFact docs=5183 queries=300 model=transformer:Xenova/all-MiniLM-L6-v2	PASS
Official benchmark: BEIR/SciFact nDCG@10 reflects real semantic retrieval (>= 0.40, far above chance)	nDCG@10=0.6927	PASS
Official benchmark: BEIR/SciFact Recall@10 >= 0.40	recall@10=0.8222	PASS
Official benchmark: dataset checksums recorded for independent reproduction	corpus=dec31c8182f3	PASS
Official benchmark: hybrid (dense+BM25 RRF) does not regress dense on nDCG@10 (same corpus/qrels)	dense=0.6443 -> hybrid=0.6902 (delta=0.0459)	PASS
Official benchmark: hybrid+cross-encoder rerank is competitive with hybrid on nDCG@10	hybrid=0.6902 -> rerank=0.6927 (delta=0.0025)	PASS