README
@forge/agent-memory-manager
Hierarchical memory management for long-running AI agents and agent fleets.
It gives an agent a tiered memory (hot / warm / cold), vector retrieval backed by Postgres + pgvector, LLM summarization hooks, and explicit policies for time-based decay, relevance scoring (cosine similarity + heuristics), and eviction (LRU + importance). It is observable (structured logs + metrics) and extensible (swap the store, embeddings, summarizer, or sync transport behind small interfaces).
Scope & honesty: the default embedding model is a local, deterministic
*lexical* hashing model and the default summarizer is *extractive* (both run
offline). Hooks for hosted semantic embeddings and a Claude summarizer are
included behind the same interfaces but are network-dependent and are not
exercised by the test suite. Seeproof/LIMITATIONS.md.
Contents
- Concepts
- Install
- Quick start
- API
- Worked example: a fleet-monitoring agent
- Storage: in-memory vs Postgres/pgvector
- Policies
- Hybrid retrieval & reranking
- Fleet sync
- Observability
- Extensibility
- Verification & proof
Concepts
| Class | Responsibility |
|---|---|
MemoryTier | Policy + bookkeeping for one tier (hot/warm/cold): decay half-life, capacity, retention scoring. |
MemoryManager | The engine. Owns a store + embeddings + summarizer + policies; exposes store / retrieve / summarizeThread / getContextForTask; runs decay/eviction; emits logs + metrics; optional fleet sync. |
AgentMemoryWrapper | Agent-friendly facade: scopes memory to an agentId, makes conversation threads ergonomic, and shares knowledge across a fleet. |
A memory flows through tiers over its lifetime:
store() ─► hot ──(decay / capacity)──► warm ──(decay / capacity)──► cold ──(capacity)──► evicted
▲
└── retrieve() promotes a strongly-matching memory back to hot ◄────────
Install
npm install
npm run build # compiles TypeScript to dist/
npm test # runs the node:test suite (incl. real pgvector via PGlite)
Runtime dependency: @electric-sql/pglite provides an in-process Postgres + pgvector engine (no server needed) and is used by the tests and the embedded store factory. pg is an *optional* dependency for talking to an external Postgres server.
Quick start
import { MemoryManager } from '@forge/agent-memory-manager';
const memory = new MemoryManager({ namespace: 'agent-1' });
await memory.init();
await memory.store('fact:disk-threshold', 'disk alerts fire at 90% usage', {
importance: 0.8,
type: 'fact',
tags: ['ops', 'thresholds'],
});
const hits = await memory.retrieve('when do disk alerts trigger?', 5);
console.log(hits[0].record.value, hits[0].score);
const ctx = await memory.getContextForTask('summarize current disk risk');
console.log(ctx.context); // bounded, relevance-ranked context string
By default the manager uses the in-process InMemoryStore, local hashing embeddings, and the extractive summarizer — so the snippet above runs with no database and no network.
API
The four methods from the brief, on MemoryManager:
// Store (or update) a memory under a stable key.
store(key: string, value: string, metadata?: MemoryMetadata): Promise<MemoryRecord>
// Retrieve the most relevant memories for a free-text query.
retrieve(query: string, limit?: number, options?: RetrieveOptions): Promise<ScoredMemory[]>
// Summarize a thread (by id, or an explicit list of entries) and persist it.
summarizeThread(thread: string | ThreadEntry[], options?: SummarizeOptions): Promise<SummaryResult>
// Assemble a bounded, relevance-ranked context string for a task.
getContextForTask(task: string | TaskSpec): Promise<TaskContext>
MemoryMetadata is open-ended and includes importance (0–1), tags, type (observation | fact | summary | alert | …), threadId, agentId, source, and shareScope ('local' | 'fleet').
A ScoredMemory exposes the full score breakdown so ranking is inspectable:
{ record, score, components: { similarity, recency, importance, frequency, tagBoost, typeBoost } }
Worked example: a fleet-monitoring agent
A long-running monitoring agent ingests telemetry/alerts, periodically summarizes incidents to keep the hot tier small, shares learned remediations across the fleet, and builds a focused context window when asked to diagnose.
Run the full version with:
npm run build
node --import tsx examples/fleet-monitoring-agent.ts
import { MemoryManager, AgentMemoryWrapper, InProcessSyncBus, createLogger } from '@forge/agent-memory-manager';
const bus = new InProcessSyncBus(); // shared across the fleet
const logger = createLogger({ level: 'info', name: 'fleet' });
function makeAgent(nodeId: string, agentId: string) {
const manager = new MemoryManager({
namespace: 'fleet-monitoring',
nodeId,
sync: bus,
logger,
tiers: { hot: { capacity: 16 }, warm: { capacity: 256 } },
});
return new AgentMemoryWrapper({ agentId, manager });
}
const east = makeAgent('monitor-east', 'agent-A');
const west = makeAgent('monitor-west', 'agent-B');
await east.init();
await west.init();
// 1) Ingest a telemetry/alert stream into hierarchical memory.
const incident = 'incident-7731';
await east.remember('ALERT: db-1 connection pool saturated, 0 free connections',
{ importance: 0.9, type: 'alert', threadId: incident, tags: ['ops'] });
await east.remember('web-3 returning HTTP 503 for /checkout',
{ importance: 0.85, type: 'alert', threadId: incident, tags: ['ops'] });
await east.remember('db-1 slow query log shows full table scan on orders',
{ importance: 0.7, type: 'observation', threadId: incident, tags: ['ops'] });
// 2) Recall the most relevant memories for an investigation.
const recalled = await east.recall('why is checkout returning 503 errors', 3);
// 3) Compress the incident thread into a summary (keeps hot memory lean).
const summary = await east.summarizeThread(incident, { maxChars: 240 });
// 4) Share a learned remediation with the whole fleet.
await east.shareWithFleet(
'Remediation: when db-1 pool saturates, raise max_connections and kill full-table-scan queries on orders.',
{ key: 'remediation:db-1-pool', importance: 0.95, tags: ['runbook'] },
);
// 5) agent-B (west) now sees the shared runbook via fleet sync.
const onWest = await west.manager.get('remediation:db-1-pool');
// 6) Build a bounded context window for a downstream task.
const ctx = await east.buildContext('diagnose the db-1 / checkout outage and recommend a fix', 600);
console.log(ctx.context);
// 7) Observe tier sizes + metrics.
console.log(await east.stats());
This pattern keeps an always-on agent bounded: high-signal alerts live in the hot tier, routine telemetry decays to warm/cold and is eventually evicted, and periodic summaries preserve the gist of old incidents without retaining every raw line.
Storage: in-memory vs Postgres/pgvector
The store is a small interface (MemoryStore). Two implementations ship:
import { InMemoryStore, createPgliteMemoryStore, PgVectorStore } from '@forge/agent-memory-manager';
// (a) default, no dependencies, not durable
const memStore = new InMemoryStore();
// (b) embedded Postgres + pgvector (PGlite, WASM) — durable if you pass a dataDir
const { store: pgStore } = await createPgliteMemoryStore({ dimensions: 256, dataDir: './memdb' });
// (c) external Postgres server via node-postgres — identical SQL/pgvector code path
import pg from 'pg';
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
const extStore = new PgVectorStore(pool, { dimensions: 256 });
await extStore.init();
const memory = new MemoryManager({ store: pgStore /* or extStore */, namespace: 'agent-1' });
PgVectorStore targets any executor exposing query(sql, params) => { rows }, which both pg.Pool and PGlite satisfy — so the embedded engine used in tests runs the same pgvector SQL (embedding <=> $1::vector) as a Postgres server.
The embedding dimension passed to the store must match your embedding
provider'sdimensions(default 256).
Policies
All policies are pure functions in src/policies.ts and are unit-tested.
- Time-based decay —
effectiveImportance = importance * 0.5^(age / tierHalfLife).
Each tier has its own half-life (hot: 1h, warm: 24h, cold: 30d by default). When a memory's effective importance falls below a tier's floor, it is demoted.
- Relevance scoring — a weighted blend of cosine similarity, recency,
importance, access frequency, plus small tag and type boosts. Weights are configurable via relevanceWeights.
- Eviction — a retention score blends LRU (recency of last access),
decayed importance, and frequency. On capacity overflow, the lowest-retention records are demoted to the next tier; cold-tier overflow is deleted (evicted).
Everything is configurable per manager:
new MemoryManager({
tiers: { hot: { capacity: 64, halfLifeMs: 30 * 60_000, minImportance: 0.7 } },
relevanceWeights: { similarity: 0.6, recency: 0.2, importance: 0.15 },
promoteThreshold: 0.6, // cosine at/above which a retrieved memory is pulled to hot
});
Hybrid retrieval & reranking
Retrieval is a modern two-stage pipeline (default retrievalMode: 'hybrid'):
- First stage — recall. A dense semantic search (vector store) and a
BM25 lexical search (Bm25Index) run in parallel and are combined with Reciprocal Rank Fusion (reciprocalRankFusion). Dense captures meaning; BM25 captures exact rare tokens — hostnames, error codes, ticket IDs — that embeddings tend to blur. Fusion is score-scale-free and robust.
- Second stage — rerank (optional). A cross-encoder reads each
(query, document) pair jointly and reorders the fused head. This is the single largest precision lever, applied only to the top-rerankTopK candidates because it runs a model per pair.
import { MemoryManager, LocalCrossEncoderReranker } from '@forge/agent-memory-manager';
const memory = new MemoryManager({
retrievalMode: 'hybrid', // 'hybrid' (default) | 'dense'
reranker: new LocalCrossEncoderReranker(), // optional MiniLM cross-encoder (transformers.js)
rerankTopK: 50, // candidates handed to the reranker
});
// Per-call overrides:
await memory.retrieve('TICKET-9981', 5); // hybrid + rerank
await memory.retrieve('disk pressure', 5, { retrievalMode: 'dense' }); // dense only
await memory.retrieve('disk pressure', 5, { rerank: false }); // skip reranker
The BM25 index is maintained incrementally inside the manager (store-agnostic), so hybrid retrieval works with any MemoryStore backend. The dense-vs-hybrid and hybrid-vs-rerank lift is measured on the official BEIR / SciFact benchmark — see bench/beir-scifact.mjs and the officialBenchmark block of the verification report.
The cross-encoder needs the optional@xenova/transformersdependency and a
one-time model download (DISCLOSED_SEAM). Without a reranker configured, the
manager runs hybrid first-stage retrieval only — fully local, no download.
Fleet sync
For distributed/fleet use, pass a SyncBus. Memories written with shareScope: 'fleet' are replicated to peer managers on the same bus and namespace; 'local' memories stay put. The built-in InProcessSyncBus (EventEmitter) is what the tests use; implement the same SyncBus interface over Redis/NATS/Kafka for a real fleet (a disclosed seam — not exercised here).
const bus = new InProcessSyncBus();
const a = new MemoryManager({ namespace: 'fleet', nodeId: 'a', sync: bus });
const b = new MemoryManager({ namespace: 'fleet', nodeId: 'b', sync: bus });
await a.store('runbook:1', 'shared knowledge', { shareScope: 'fleet' });
// b.get('runbook:1') now resolves after the event propagates
Last-writer-wins by updatedAt; a node ignores its own events (no echo loop).
Observability
- Logs —
createLogger({ level, name })emits structured JSON lines to
stderr. Inject a sink to route elsewhere; use silentLogger in tests.
- Metrics —
manager.metrics.snapshot()returns counters
(memory_store_total, memory_retrieve_total, memory_evict_total, memory_demote_total, memory_promote_total, memory_replicated_total, …), gauges (memory_tier_size{tier}), and latency histograms (memory_store_ms, memory_retrieve_ms, memory_summarize_ms). manager.stats() adds live per-tier sizes. Swap in Prometheus/OTEL via the Metrics interface.
Extensibility
Every collaborator is an interface with a sensible default:
| Interface | Default | Swap for |
|---|---|---|
MemoryStore | InMemoryStore | PgVectorStore (PGlite or external Postgres), your own backend |
EmbeddingProvider | HashingEmbeddingProvider (local, lexical) | LocalTransformerEmbeddingProvider (MiniLM), RemoteEmbeddingProvider (OpenAI/Voyage/Cohere/…) |
Reranker | none (hybrid first-stage only) | LocalCrossEncoderReranker (MiniLM cross-encoder), or any hosted reranker |
Summarizer | ExtractiveSummarizer (local) | ClaudeSummarizer (Anthropic) or any LLM |
SyncBus | InProcessSyncBus | Redis/NATS/Kafka adapter |
Logger / Metrics | JSON logger / in-memory metrics | pino/Prometheus/OTEL adapters |
Example: hosted embeddings + Claude summarization.
import { MemoryManager, RemoteEmbeddingProvider, ClaudeSummarizer } from '@forge/agent-memory-manager';
const embeddings = new RemoteEmbeddingProvider({
dimensions: 1536,
embedFn: async (texts) => callYourEmbeddingApi(texts), // returns number[][]
});
const summarizer = new ClaudeSummarizer({ model: 'claude-3-5-sonnet-latest' }); // uses ANTHROPIC_API_KEY
const memory = new MemoryManager({ embeddings, summarizer /* + a matching-dimension store */ });
Verification & proof
node verify.mjs # builds, runs unit tests, runs the synthetic retrieval
# benchmark over BOTH the in-memory store and real pgvector,
# checks decay/eviction/sync/metrics, writes verification-report.*
This outcome carries a full Forge proof package under proof/ (evidence, claim lineage, checksums, Evidence Grade, Trust Score, auditor challenge, and the authoritative gate decision). Start with proof/EXECUTIVE_EVIDENCE.md and proof/REPRODUCE.md.