Agent Memory Manager

README

@forge/agent-memory-manager

Hierarchical memory management for long-running AI agents and agent fleets.

It gives an agent a tiered memory (hot / warm / cold), vector retrieval backed by Postgres + pgvector, LLM summarization hooks, and explicit policies for time-based decay, relevance scoring (cosine similarity + heuristics), and eviction (LRU + importance). It is observable (structured logs + metrics) and extensible (swap the store, embeddings, summarizer, or sync transport behind small interfaces).

Scope & honesty: the default embedding model is a local, deterministic
*lexical* hashing model and the default summarizer is *extractive* (both run
offline). Hooks for hosted semantic embeddings and a Claude summarizer are
included behind the same interfaces but are network-dependent and are not
exercised by the test suite. See proof/LIMITATIONS.md.

Concepts
Install
Quick start
API
Worked example: a fleet-monitoring agent
Storage: in-memory vs Postgres/pgvector
Policies
Hybrid retrieval & reranking
Fleet sync
Observability
Extensibility
Verification & proof

Concepts

Class	Responsibility
`MemoryTier`	Policy + bookkeeping for one tier (`hot`/`warm`/`cold`): decay half-life, capacity, retention scoring.
`MemoryManager`	The engine. Owns a store + embeddings + summarizer + policies; exposes `store` / `retrieve` / `summarizeThread` / `getContextForTask`; runs decay/eviction; emits logs + metrics; optional fleet sync.
`AgentMemoryWrapper`	Agent-friendly facade: scopes memory to an `agentId`, makes conversation threads ergonomic, and shares knowledge across a fleet.

A memory flows through tiers over its lifetime:

store() ─► hot ──(decay / capacity)──► warm ──(decay / capacity)──► cold ──(capacity)──► evicted
   ▲                                                                      
   └── retrieve() promotes a strongly-matching memory back to hot ◄────────

Install

npm install
npm run build      # compiles TypeScript to dist/
npm test           # runs the node:test suite (incl. real pgvector via PGlite)

Runtime dependency: @electric-sql/pglite provides an in-process Postgres + pgvector engine (no server needed) and is used by the tests and the embedded store factory. pg is an *optional* dependency for talking to an external Postgres server.

Quick start

import { MemoryManager } from '@forge/agent-memory-manager';

const memory = new MemoryManager({ namespace: 'agent-1' });
await memory.init();

await memory.store('fact:disk-threshold', 'disk alerts fire at 90% usage', {
  importance: 0.8,
  type: 'fact',
  tags: ['ops', 'thresholds'],
});

const hits = await memory.retrieve('when do disk alerts trigger?', 5);
console.log(hits[0].record.value, hits[0].score);

const ctx = await memory.getContextForTask('summarize current disk risk');
console.log(ctx.context); // bounded, relevance-ranked context string

By default the manager uses the in-process InMemoryStore, local hashing embeddings, and the extractive summarizer — so the snippet above runs with no database and no network.

API

The four methods from the brief, on MemoryManager:

// Store (or update) a memory under a stable key.
store(key: string, value: string, metadata?: MemoryMetadata): Promise<MemoryRecord>

// Retrieve the most relevant memories for a free-text query.
retrieve(query: string, limit?: number, options?: RetrieveOptions): Promise<ScoredMemory[]>

// Summarize a thread (by id, or an explicit list of entries) and persist it.
summarizeThread(thread: string | ThreadEntry[], options?: SummarizeOptions): Promise<SummaryResult>

// Assemble a bounded, relevance-ranked context string for a task.
getContextForTask(task: string | TaskSpec): Promise<TaskContext>

A ScoredMemory exposes the full score breakdown so ranking is inspectable:

{ record, score, components: { similarity, recency, importance, frequency, tagBoost, typeBoost } }

Worked example: a fleet-monitoring agent

A long-running monitoring agent ingests telemetry/alerts, periodically summarizes incidents to keep the hot tier small, shares learned remediations across the fleet, and builds a focused context window when asked to diagnose.

Run the full version with:

npm run build
node --import tsx examples/fleet-monitoring-agent.ts

import { MemoryManager, AgentMemoryWrapper, InProcessSyncBus, createLogger } from '@forge/agent-memory-manager';

const bus = new InProcessSyncBus();                  // shared across the fleet
const logger = createLogger({ level: 'info', name: 'fleet' });

function makeAgent(nodeId: string, agentId: string) {
  const manager = new MemoryManager({
    namespace: 'fleet-monitoring',
    nodeId,
    sync: bus,
    logger,
    tiers: { hot: { capacity: 16 }, warm: { capacity: 256 } },
  });
  return new AgentMemoryWrapper({ agentId, manager });
}

const east = makeAgent('monitor-east', 'agent-A');
const west = makeAgent('monitor-west', 'agent-B');
await east.init();
await west.init();

// 1) Ingest a telemetry/alert stream into hierarchical memory.
const incident = 'incident-7731';
await east.remember('ALERT: db-1 connection pool saturated, 0 free connections',
  { importance: 0.9, type: 'alert', threadId: incident, tags: ['ops'] });
await east.remember('web-3 returning HTTP 503 for /checkout',
  { importance: 0.85, type: 'alert', threadId: incident, tags: ['ops'] });
await east.remember('db-1 slow query log shows full table scan on orders',
  { importance: 0.7, type: 'observation', threadId: incident, tags: ['ops'] });

// 2) Recall the most relevant memories for an investigation.
const recalled = await east.recall('why is checkout returning 503 errors', 3);

// 3) Compress the incident thread into a summary (keeps hot memory lean).
const summary = await east.summarizeThread(incident, { maxChars: 240 });

// 4) Share a learned remediation with the whole fleet.
await east.shareWithFleet(
  'Remediation: when db-1 pool saturates, raise max_connections and kill full-table-scan queries on orders.',
  { key: 'remediation:db-1-pool', importance: 0.95, tags: ['runbook'] },
);

// 5) agent-B (west) now sees the shared runbook via fleet sync.
const onWest = await west.manager.get('remediation:db-1-pool');

// 6) Build a bounded context window for a downstream task.
const ctx = await east.buildContext('diagnose the db-1 / checkout outage and recommend a fix', 600);
console.log(ctx.context);

// 7) Observe tier sizes + metrics.
console.log(await east.stats());

This pattern keeps an always-on agent bounded: high-signal alerts live in the hot tier, routine telemetry decays to warm/cold and is eventually evicted, and periodic summaries preserve the gist of old incidents without retaining every raw line.

Storage: in-memory vs Postgres/pgvector

The store is a small interface (MemoryStore). Two implementations ship:

import { InMemoryStore, createPgliteMemoryStore, PgVectorStore } from '@forge/agent-memory-manager';

// (a) default, no dependencies, not durable
const memStore = new InMemoryStore();

// (b) embedded Postgres + pgvector (PGlite, WASM) — durable if you pass a dataDir
const { store: pgStore } = await createPgliteMemoryStore({ dimensions: 256, dataDir: './memdb' });

// (c) external Postgres server via node-postgres — identical SQL/pgvector code path
import pg from 'pg';
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
const extStore = new PgVectorStore(pool, { dimensions: 256 });
await extStore.init();

const memory = new MemoryManager({ store: pgStore /* or extStore */, namespace: 'agent-1' });

PgVectorStore targets any executor exposing query(sql, params) => { rows }, which both pg.Pool and PGlite satisfy — so the embedded engine used in tests runs the same pgvector SQL (embedding <=> $1::vector) as a Postgres server.

The embedding dimension passed to the store must match your embedding
provider's dimensions (default 256).

Policies

All policies are pure functions in src/policies.ts and are unit-tested.

Time-based decay — effectiveImportance = importance * 0.5^(age / tierHalfLife).

Each tier has its own half-life (hot: 1h, warm: 24h, cold: 30d by default). When a memory's effective importance falls below a tier's floor, it is demoted.

Relevance scoring — a weighted blend of cosine similarity, recency,

importance, access frequency, plus small tag and type boosts. Weights are configurable via relevanceWeights.

Eviction — a retention score blends LRU (recency of last access),

decayed importance, and frequency. On capacity overflow, the lowest-retention records are demoted to the next tier; cold-tier overflow is deleted (evicted).

Everything is configurable per manager:

new MemoryManager({
  tiers: { hot: { capacity: 64, halfLifeMs: 30 * 60_000, minImportance: 0.7 } },
  relevanceWeights: { similarity: 0.6, recency: 0.2, importance: 0.15 },
  promoteThreshold: 0.6, // cosine at/above which a retrieved memory is pulled to hot
});

Hybrid retrieval & reranking

Retrieval is a modern two-stage pipeline (default retrievalMode: 'hybrid'):

First stage — recall. A dense semantic search (vector store) and a

BM25 lexical search (Bm25Index) run in parallel and are combined with Reciprocal Rank Fusion (reciprocalRankFusion). Dense captures meaning; BM25 captures exact rare tokens — hostnames, error codes, ticket IDs — that embeddings tend to blur. Fusion is score-scale-free and robust.

Second stage — rerank (optional). A cross-encoder reads each

(query, document) pair jointly and reorders the fused head. This is the single largest precision lever, applied only to the top-rerankTopK candidates because it runs a model per pair.

import { MemoryManager, LocalCrossEncoderReranker } from '@forge/agent-memory-manager';

const memory = new MemoryManager({
  retrievalMode: 'hybrid',                       // 'hybrid' (default) | 'dense'
  reranker: new LocalCrossEncoderReranker(),     // optional MiniLM cross-encoder (transformers.js)
  rerankTopK: 50,                                // candidates handed to the reranker
});

// Per-call overrides:
await memory.retrieve('TICKET-9981', 5);                       // hybrid + rerank
await memory.retrieve('disk pressure', 5, { retrievalMode: 'dense' }); // dense only
await memory.retrieve('disk pressure', 5, { rerank: false });          // skip reranker

The BM25 index is maintained incrementally inside the manager (store-agnostic), so hybrid retrieval works with any MemoryStore backend. The dense-vs-hybrid and hybrid-vs-rerank lift is measured on the official BEIR / SciFact benchmark — see bench/beir-scifact.mjs and the officialBenchmark block of the verification report.

The cross-encoder needs the optional @xenova/transformers dependency and a
one-time model download (DISCLOSED_SEAM). Without a reranker configured, the
manager runs hybrid first-stage retrieval only — fully local, no download.

Fleet sync

For distributed/fleet use, pass a SyncBus. Memories written with shareScope: 'fleet' are replicated to peer managers on the same bus and namespace; 'local' memories stay put. The built-in InProcessSyncBus (EventEmitter) is what the tests use; implement the same SyncBus interface over Redis/NATS/Kafka for a real fleet (a disclosed seam — not exercised here).

const bus = new InProcessSyncBus();
const a = new MemoryManager({ namespace: 'fleet', nodeId: 'a', sync: bus });
const b = new MemoryManager({ namespace: 'fleet', nodeId: 'b', sync: bus });
await a.store('runbook:1', 'shared knowledge', { shareScope: 'fleet' });
// b.get('runbook:1') now resolves after the event propagates

Last-writer-wins by updatedAt; a node ignores its own events (no echo loop).

Observability

Logs — createLogger({ level, name }) emits structured JSON lines to

stderr. Inject a sink to route elsewhere; use silentLogger in tests.

Metrics — manager.metrics.snapshot() returns counters

(memory_store_total, memory_retrieve_total, memory_evict_total, memory_demote_total, memory_promote_total, memory_replicated_total, …), gauges (memory_tier_size{tier}), and latency histograms (memory_store_ms, memory_retrieve_ms, memory_summarize_ms). manager.stats() adds live per-tier sizes. Swap in Prometheus/OTEL via the Metrics interface.

Extensibility

Every collaborator is an interface with a sensible default:

Interface	Default	Swap for
`MemoryStore`	`InMemoryStore`	`PgVectorStore` (PGlite or external Postgres), your own backend
`EmbeddingProvider`	`HashingEmbeddingProvider` (local, lexical)	`LocalTransformerEmbeddingProvider` (MiniLM), `RemoteEmbeddingProvider` (OpenAI/Voyage/Cohere/…)
`Reranker`	none (hybrid first-stage only)	`LocalCrossEncoderReranker` (MiniLM cross-encoder), or any hosted reranker
`Summarizer`	`ExtractiveSummarizer` (local)	`ClaudeSummarizer` (Anthropic) or any LLM
`SyncBus`	`InProcessSyncBus`	Redis/NATS/Kafka adapter
`Logger` / `Metrics`	JSON logger / in-memory metrics	pino/Prometheus/OTEL adapters

Example: hosted embeddings + Claude summarization.

import { MemoryManager, RemoteEmbeddingProvider, ClaudeSummarizer } from '@forge/agent-memory-manager';

const embeddings = new RemoteEmbeddingProvider({
  dimensions: 1536,
  embedFn: async (texts) => callYourEmbeddingApi(texts), // returns number[][]
});
const summarizer = new ClaudeSummarizer({ model: 'claude-3-5-sonnet-latest' }); // uses ANTHROPIC_API_KEY

const memory = new MemoryManager({ embeddings, summarizer /* + a matching-dimension store */ });

Verification & proof

node verify.mjs    # builds, runs unit tests, runs the synthetic retrieval
                   # benchmark over BOTH the in-memory store and real pgvector,
                   # checks decay/eviction/sync/metrics, writes verification-report.*

This outcome carries a full Forge proof package under proof/ (evidence, claim lineage, checksums, Evidence Grade, Trust Score, auditor challenge, and the authoritative gate decision). Start with proof/EXECUTIVE_EVIDENCE.md and proof/REPRODUCE.md.