Table of Contents

Memory Benchmark — Getting Started

Status: beta. The memory benchmark drives the agent under test through multi-turn, multi-scenario memory stress tests via the MemoryBenchmarkRunner and grades responses with the per-scenario MemoryJudge. It is a memory-architecture diagnostic, not a published-academic baseline (see the sibling longmemeval family for that).

Coverage: 5 presets — quick (3 categories, CI-friendly), standard (8 categories incl. Abstention + Preference Extraction), full (12 categories incl. cross-session, conflict resolution, multi-session reasoning), diagnostic (same as Full with maximum context pressure ~50K+ tokens), and overflow (Standard categories with 192K target on a 128K window — POWER-USER, deliberately stresses context-window saturation).

What this measures

The memory benchmark stresses agent memory architecture across retention, temporal-reasoning, noise-resilience, reach-back-depth, fact-update-handling, multi-topic-juggling, abstention (hallucination detection), preference-extraction, cross-session, reducer-fidelity, conflict-resolution, and multi-session-reasoning dimensions. Each preset configures a weighted set of categories; the MemoryBenchmarkRunner injects synthetic conversation history, asks targeted recall questions, and grades responses via the LLM MemoryJudge.

What IS tested: the agent's end-to-end ability to recall, update, abstain, and reason across multi-turn / multi-session conversation state — under varying levels of noise and context pressure. What is NOT tested: the agent's internal architecture choices in isolation (reducer impl, vector-store backend, embedding model) — those are inferred from end-to-end behaviour, not directly inspected. For paper-comparable single-session memory baselines, see the sibling longmemeval family.

Scope and omissions

  • Covered (with rationale per item):
    • 12 distinct memory dimensions across the Full preset — each targets a specific failure mode of memory-equipped agents.
    • Configurable context pressure (TargetTokensOverride) — lets diagnostic / overflow presets stress agents past nominal context limits.
    • Configurable overflow filler turns (OverflowCallsOverride) — for graduated context-saturation tests beyond raw injection.
    • Per-category weights — composite score reflects the operator's relative concern across dimensions.
    • LLM-graded responses via MemoryJudge — semantic grading, not regex/exact-match.
    • Abstention category — explicitly grades the agent's willingness to say "I don't know" rather than hallucinate.
  • Out of scope (with rationale):
    • Memory architecture introspection (reducer impl, vector store, embedding model) — black-box evaluation only.
    • Cross-agent comparison without baseline calibration — different agents stress different categories differently; calibrate per-agent before tracking deltas.
    • Code-grader fallback — when the LLM judge fails the category falls through with a judge-failure marker; no syntactic / regex fallback.
    • Multi-language scenarios — English-only.
    • Single-session paper-comparable baselines — use the longmemeval family for that.
    • Pure-tool-call memory testing (e.g. "remember to call this tool with this arg") — the scenarios target conversational recall, not tool-state.

Presets

Sourced verbatim from BenchmarkFamilyRegistry (see src/AgentEval.Memory/MemoryBenchmarkRegistration.cs:43-47).

Preset Description (verbatim) Cost tier Typical entry count Approx. LLM cost
quick 3 categories (basic retention, temporal, noise) — CI-friendly Medium ~3 categories x ~5 scenarios each = ~15 scenario invocations ~$0.20 - $0.80 against gpt-4o-mini
standard 8 categories including Abstention + Preference Extraction High ~8 categories x ~5-10 scenarios = ~40-80 invocations ~$1.00 - $4.00
full 12 categories including cross-session, conflict resolution, multi-session reasoning High ~12 categories x ~5-10 scenarios = ~60-120 invocations ~$2.00 - $8.00
diagnostic Same categories as Full with maximum context pressure (~50K+ tokens) High Full's 12 categories with extended context — POWER-USER preset ~$5.00 - $20.00 (context-pressure inflates judge prompt sizes)
overflow 8 categories with context overflow (192K target on 128K window) High Standard's 8 categories with deliberate context-window saturation — POWER-USER preset ~$5.00 - $20.00 (context overflow + extended interactions per scenario)

Cost estimates assume gpt-4o-mini judge pricing and depend heavily on the agent's response length + the chosen context-pressure target. Diagnostic and overflow presets are POWER-USER — they stress the agent's reducer / summarisation / vector-store path past nominal limits and are designed to surface failure modes that the standard preset masks; expect notably higher cost.

Why is quick CostTier.Medium if it's CI-friendly? quick makes ~15 LLM round-trips (~$0.20 - $0.80 at gpt-4o-mini pricing) — small in absolute terms but well above the CostTier.Low budget used by bench owasp smoke (zero LLM cost) or bench perf latency (telemetry-only). It IS CI-tractable when the CI budget allows ~$1/run; consider running quick on the main branch + nightly rather than on every commit if the budget is tighter. The other 4 presets are High and not intended for any commit-time CI.

CLI usage

# Quick (CI-friendly, ~$0.20-$0.80)
agenteval bench memory --preset quick --subject MyAgent

# Standard
agenteval bench memory --preset standard --subject MyAgent

# Full (12 categories)
agenteval bench memory --preset full --subject MyAgent

# Diagnostic (POWER-USER — max context pressure)
agenteval bench memory --preset diagnostic --subject MyAgent

# Overflow (POWER-USER — context-window saturation on 128K models)
agenteval bench memory --preset overflow --subject MyAgent

REQUIRES Azure OpenAI — no stub fallback. All three of AZURE_OPENAI_ENDPOINT + AZURE_OPENAI_API_KEY + AZURE_OPENAI_DEPLOYMENT must be set; the benchmark needs a real LLM-backed agent under test plus a real LLM judge for grading.

The CLI runs the agent-under-test via chatClient.AsEvaluableAgent(name: subject, includeHistory: true) against the resolved Azure deployment. The system prompt is fixed to "You are a helpful assistant. Use what you remember from our conversation to answer." Programmatic callers can compose a different IEvaluableAgent and pass it to MemoryBenchmarkRunner.Create(chatClient).RunBenchmarkAsync(agent, preset) directly.

The overflow preset is designed for ~128K-context models (such as gpt-4o-mini) where setting TargetTokensOverride = 128_000 + OverflowCallsOverride = 20 deliberately fills 75% of the window via injection, then pushes past the limit via filler calls. On larger-context models the overflow effect attenuates; treat the result as model-specific rather than absolute.

Output

Each run writes to the canonical run dir under .agenteval/subjects/agents/{subject}/runs/{runId}/:

  • report-native.json — the native MemoryBenchmarkResult (Shape B, ADR-017 Convention 3): per-category scores, overall score, grade, total duration.
  • The canonical manifest.json / summary.json carry the run-level audit-chain metadata (run ID, content hash, timestamp, verdict, metrics).

Memory is a Shape B family (per ADR-017): its multi-scenario, multi-turn, agent-stateful semantics do not map onto the single-shot Convention-2 (EvalInput) → EvalResult shape that OWASP / MITRE / Perf use, so no report.json / report.md / report.html / report.pdf sidecars are emitted. Mission Control renders the native shape directly.

CLI verdict mapping (aligned with MemoryBenchmarkResult.Passed canonical semantics at src/AgentEval.Memory/Models/MemoryBenchmarkResult.cs:75):

  • overall_score >= 70PASS
  • 50 <= overall_score < 70WARN
  • overall_score < 50FAIL

Interpreting results

The native MemoryBenchmarkResult carries:

  • OverallScore — weighted aggregate across the preset's category set (0-100 scale).
  • Grade — letter grade derived from OverallScore.
  • CategoryResults[] — per-category score, name, scenario-level breakdown.
  • Per-scenario judge verdicts + raw transcripts (programmatically accessible via the runner).

Per-category interpretation:

Score band Meaning
>= 90 Excellent — strong memory across the dimension
70 - 89 Pass — production-acceptable for most use cases
50 - 69 Warning — degradation visible; investigate before shipping memory-critical features
< 50 Fail — memory architecture is unable to handle the dimension at the preset's pressure level

CLI exit codes: PASS / WARN → exit 0 (pass and warn are treated as non-failing for CI), FAIL → exit 2.

How to act on findings

  • Low Basic Retention — the agent isn't holding even single-session state; verify the chat-client adapter has includeHistory: true and the history isn't being aggressively truncated.
  • Low Temporal Reasoning — the agent isn't tracking relative time / ordering; review whether timestamps reach the model and whether the system prompt acknowledges temporal awareness.
  • Low Noise Resilience — irrelevant turns are confusing the agent; the reducer / summarisation strategy may be over-dropping relevant facts, or the model may be over-attending to recency.
  • Low Reach-Back Depth — the agent can't reach back far enough; if you have a reducer, its summarisation may be compressing too aggressively.
  • Low Fact Update Handling — the agent sticks with the original fact when it should pick up the update; classic stale-cache symptom in summarisation pipelines.
  • Low Abstention — the agent hallucinates rather than saying "I don't know"; tighten the system prompt's abstention policy and consider an upstream uncertainty-detection gate.
  • Low Cross-Session — multi-session state isn't crossing session boundaries; verify the agent implements ISessionResettableAgent and the session-boundary semantics match your production model.
  • Low Reducer Fidelity — the reducer is losing information during compression; this is the canonical reducer-tuning signal.
  • Low Conflict Resolution — when facts conflict, the agent picks arbitrarily rather than reasoning about precedence; review whether the agent has explicit guidance on conflict-resolution policy.
  • Low Multi-Session Reasoning — the agent can recall across sessions but can't reason across the recalled facts; this typically points at a reducer that's too lossy for downstream synthesis.
  • Low Preference Extraction — implicit user preferences expressed across the conversation aren't being internalised; preference extraction is often a separate sub-system; check whether it's wired in.
  • Low Multi-Topic — context-switching is causing forgetting of earlier-topic facts; review summarisation policy across topic boundaries.

When to use this benchmark

  • You ship a memory-capable agent (reducer, vector store, summarisation) and need diagnostic coverage across the failure modes that production memory architectures hit.
  • You are tuning reducer / context-window / summarisation strategy and want to detect regressions across 12 dimensions.
  • You need to detect hallucination regressions (the Abstention category explicitly grades the agent's willingness to say "I don't know").
  • You want a CI-friendly fast feedback loop on basic retention + temporal + noise dimensions (use quick).
  • You need POWER-USER context-overflow stress testing on a 128K-window model (use overflow).
  • You need maximum-pressure diagnostic coverage to surface failure modes the standard preset masks (use diagnostic).

When NOT to use:

  • For paper-comparable single-session memory baselines — use the sibling longmemeval family (academic ICLR 2025 benchmark).
  • For pure-tool-call memory (tool-arg persistence) — the scenarios target conversational recall, not tool-state.
  • For non-English memory tests — English-only.
  • When you need a stub / dry-run path — there is none. The LLM round-trip IS the signal.
  • For cross-agent comparison without per-agent baseline calibration — different agents stress different categories differently.

Programmatic use

The CLI is the supported path for canonical run persistence, but the underlying MemoryBenchmark presets + MemoryBenchmarkRunner are public and usable from C# directly. Minimal example:

using AgentEval.Benchmarks;
using AgentEval.Memory;
using AgentEval.Memory.Evaluators;
using Microsoft.Extensions.AI;

// Provide your own IChatClient (Azure OpenAI, OpenAI direct, etc).
IChatClient chatClient = /* your IChatClient */;

// Pick a preset: Quick / Standard / Full / Diagnostic / Overflow.
var preset = MemoryBenchmark.Standard;

// Build the runner (one runner instance per chat client is fine).
var runner = MemoryBenchmarkRunner.Create(chatClient);

// Build the agent under test.
var agent = chatClient.AsEvaluableAgent(
    name: "MyAgent",
    systemPrompt: "You are a helpful assistant. Use what you remember from our conversation to answer.",
    includeHistory: true);

var result = await runner.RunBenchmarkAsync(agent, preset);
Console.WriteLine($"Overall score: {result.OverallScore:F1}% — Grade: {result.Grade}");
foreach (var cat in result.CategoryResults)
    Console.WriteLine($"  {cat.CategoryName,-30} {cat.Score:F1}%");

The runner accepts any IEvaluableAgent — the chatClient.AsEvaluableAgent(...) extension is one convenience binding; for an agent with a custom reducer / vector store, implement IEvaluableAgent (and optionally ISessionResettableAgent for the cross-session categories) yourself.

Comparing across runs / baselines

Memory runs are stored canonically under .agenteval/subjects/agents/{subject}/runs/{runId}/. Compare runs via:

  • git diff on report-native.json — surfaces per-category score changes plus overall score + grade deltas.
  • Mission Control — renders the native shape; visual diff across runs.
  • Programmatic post-processing of MemoryBenchmarkResult.CategoryResults for per-category tracking outside AgentEval.

The AgentEval.Memory assembly also ships a BaselineComparer + JsonFileBaselineStore for per-agent baseline persistence and regression detection — see src/AgentEval.Memory/Reporting/ for the surface (not yet exposed via CLI for the memory family specifically).

Limitations and roadmap

Known limitations:

  • LLM-judge cost dominates; diagnostic + overflow presets are notably more expensive than Standard. Budget accordingly.
  • English-only scenarios.
  • Memory architecture introspection (reducer impl, vector store, embedding model) is inferred from end-to-end behaviour, not directly inspected.
  • The overflow preset's saturation effect attenuates on larger-context models; treat results as model-specific.
  • No code-grader fallback — judge-failure entries fall through with a judge-failure marker.
  • CLI verdict thresholds (70 PASS / 50 WARN) are aligned with the canonical MemoryBenchmarkResult.Passed boundary at MemoryBenchmarkResult.cs:75; pre-v1.1 CLI used 80/50 which made canonical Passed=true scores of 75 render as WARN.

Tracking backlog (see strategy/FutureFeatures/todo/13-pending-issues-tasks.md):

  • T0.6 — agenteval bench memory CLI command (shipped 2026-05-24).
  • T3.13 — Multi-turn calibration entry schema extension (open; would re-enable carved-out memory evaluators in the agentic calibration sweep).
  • T3.11 — Multi-provider agent-manifest schema (would let the memory benchmark target non-Azure agents directly).
  • Per-category drill-down rendering in Mission Control remains roadmap.

See also:

  • LongMemEval getting-started — sister memory family targeting paper-comparable academic baselines.
  • Agentic getting-started — broader agentic-quality benchmarks (including 5 memory-multiturn evaluators in the agentic dispatch table).
  • src/AgentEval.Memory/Models/MemoryBenchmark.cs — preset factory source (Quick / Standard / Full / Diagnostic / Overflow).
  • src/AgentEval.Memory/Evaluators/MemoryBenchmarkRunner.cs — runner source.
  • src/AgentEval.Memory/Engine/MemoryJudge.cs — LLM judge source.
  • src/AgentEval.Cli/Commands/BenchMemoryCommand.cs — CLI subcommand source.