ADR-004: Trace Recording and Replay for Time-Travelling Debugging

Status: Accepted
Date: January 7, 2026
Authors: AgentEval Team
Supersedes: None
Related: ADR-002 (Result Directory Structure)

Context

AgentEval needs deterministic, fast, and cost-free test execution for CI/CD pipelines. Currently, every test run requires live LLM API calls, which are:

Expensive — API costs accumulate with every test run
Slow — Network latency adds 500ms-5s per request
Non-deterministic — Same prompt can yield different responses
Unreliable — Depends on external service availability

Additionally, debugging agent failures is difficult because execution context is ephemeral — once a test fails, the exact tool calls, responses, and timing are lost.

Research Findings

We evaluated existing patterns and standards:

Pattern	Language	Approach	Suitability
VCR (Ruby)	Ruby	HTTP cassettes (YAML/JSON)	HTTP-level only, not agent-aware
Azure SDK Test Framework	.NET	Test Proxy with Record/Playback	Mature but complex, HTTP-focused
scotch / Betamax.NET	.NET	VCR ports	HTTP-focused, not maintained
OpenTelemetry Traces	Multi	Spans with attributes	Observability, not replay
Promptfoo caching	Node	Response cache	Simple cache, not full trace
LangSmith	Python	Tracing + evaluation	SaaS-focused, Python-only

Key Finding: No existing standard captures agent-specific concerns:

Tool call semantics (name, arguments, results, ordering)
Multi-turn conversation state
Streaming token chunks and TTFT
Performance metrics (latency, cost)

Decision

1. Create AgentEval-Specific Trace Format

We will define our own JSON-based trace format rather than adopt an existing standard.

Rationale:

Agent-specific concerns are not covered by HTTP-level patterns
We can design for our exact use cases
Enables tool call-aware matching and validation
Can export to OpenTelemetry for observability if needed
Positions AgentEval as a potential industry standard

2. Trace Recording via Wrapping Pattern

TraceRecorder wraps IEvaluableAgent and intercepts all calls, recording them to an AgentTrace object.

┌─────────────────────────────────────────────────────────────────────────┐
│                         RECORDING FLOW                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Test Code           TraceRecordingAgent          Actual Agent         │
│       │                      │                          │               │
│       │──InvokeAsync(prompt)─▶│                          │               │
│       │                      │──Record Request──────────▶│               │
│       │                      │                          │               │
│       │                      │◀─────────Response─────────│               │
│       │                      │──Record Response─────────▶│               │
│       │◀─────Response────────│                          │               │
│       │                      │                          │               │
│                                                                         │
│   On disposal: Save AgentTrace to .trace.json                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Why wrapping, not middleware:

Works with any IEvaluableAgent implementation
No changes required to test code
Transparent to the evaluation harness
Consistent with existing adapter patterns (ChatClientAgentAdapter)

3. Trace Replay via Mock Agent

TraceReplayingAgent implements IEvaluableAgent and returns recorded responses.

┌─────────────────────────────────────────────────────────────────────────┐
│                          REPLAY FLOW                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Test Code         TraceReplayingAgent            [No API Calls]       │
│       │                      │                                          │
│       │──InvokeAsync(prompt)─▶│                                          │
│       │                      │──Find matching entry──▶                  │
│       │                      │                                          │
│       │◀─────Recorded Response│                                          │
│       │                      │                                          │
│                                                                         │
│   Execution: Instantaneous, deterministic, zero cost                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Request matching strategy: Index-based (simplest, deterministic)

Entry N in trace matches call N in test
Future: Optional content-based matching for flexibility

4. NOT a New Exporter — Separate Concern

TraceRecorder is NOT added to IResultExporter.

Rationale:

IResultExporter exports evaluation results after execution
TraceRecorder captures execution traces during execution
Different lifecycles, different data models
Mixing concerns would complicate both

Architecture:

┌─────────────────────────────────────────────────────────────────────────┐
│                        SEPARATION OF CONCERNS                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  DURING EXECUTION                    AFTER EXECUTION                    │
│  ─────────────────                   ─────────────────                  │
│  TraceRecorder                       IResultExporter                    │
│  ├─ Wraps IEvaluableAgent             ├─ JsonExporter                    │
│  ├─ Records requests/responses       ├─ JUnitXmlExporter                │
│  └─ Outputs: .trace.json             ├─ MarkdownExporter                │
│                                      └─ TrxExporter                     │
│                                                                         │
│  Data: AgentTrace                    Data: EvaluationReport             │
│  Purpose: Replay & Debug             Purpose: CI/CD & Reporting         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

5. Trace File Format

JSON with versioned schema:

{
  "version": "1.0",
  "traceName": "weather_test",
  "capturedAt": "2026-01-07T10:30:00Z",
  "agentName": "WeatherAgent",
  "entries": [
    {
      "type": "Request",
      "index": 0,
      "timestamp": "2026-01-07T10:30:00.000Z",
      "prompt": "What's the weather in Paris?"
    },
    {
      "type": "Response",
      "index": 0,
      "timestamp": "2026-01-07T10:30:00.500Z",
      "text": "The weather in Paris is 22°C and sunny.",
      "durationMs": 500,
      "toolCalls": [...],
      "tokenUsage": { "prompt": 50, "completion": 30 }
    }
  ],
  "performance": {
    "totalDurationMs": 500,
    "totalPromptTokens": 50,
    "totalCompletionTokens": 30
  }
}

Schema hosted at: https://agenteval.dev/schemas/trace-v1.json (future)

6. CLI Integration

# Record mode
agenteval test --record --output ./traces

# Replay mode
agenteval test --replay ./traces

# View a trace
agenteval trace show ./traces/weather_test.trace.json

Consequences

Positive

100x faster tests — Replay is instantaneous (no network)
Zero API cost — Recorded traces can run indefinitely
Deterministic — Same trace = same result every time
Debuggable — Traces can be inspected, edited, shared
CI/CD friendly — Record nightly, replay on every PR
Future-proof — Versioned format allows evolution

Negative

Storage overhead — Traces are stored in repo or separate location
Drift risk — Agent changes may invalidate old traces
Maintenance — Traces must be re-recorded when prompts change
Complexity — Two modes to understand and manage

Mitigations

Storage: Compress traces; store in separate assets repo (Azure SDK pattern)
Drift: Provide --force-record and automated drift detection
Maintenance: CI job to detect trace staleness
Complexity: Good defaults; record-once semantics

Implementation

Phase 1: Core Classes ✅ IMPLEMENTED

AgentTrace — Data model for recorded traces (AgentTrace.cs)
TraceEntry — Individual request/response entry
TraceRecordingAgent — Wrapper that records (TraceRecordingAgent.cs)
TraceReplayingAgent — Mock that replays (TraceReplayingAgent.cs)
TraceSerializer — JSON serialization (TraceSerializer.cs)
ChatTraceRecorder — Multi-turn conversation recording (ChatTraceRecorder.cs)
WorkflowTraceRecorder — Multi-agent workflow recording (WorkflowTraceRecorder.cs)
WorkflowTraceReplayingAgent — Workflow replay (WorkflowTraceReplayingAgent.cs)
Tests: 56+ tests across 3 test files

Phase 2: Integration (Pending)

❌ Evaluation harness integration (EvaluationOptions.TraceRecorder)
❌ CLI commands (agenteval test --record/--replay)

Phase 3: Advanced Features (Partially Implemented)

✅ Streaming chunk recording/replay (implemented in TraceRecordingAgent.InvokeStreamingAsync)
✅ Request matching modes (Any/Exact/Contains in TraceReplayOptions)
❌ Content-based matching (semantic similarity)
❌ Trace diff for debugging
❌ VS Code extension integration

Alternatives Considered

A. Use Azure SDK Test Proxy

Rejected: Too heavy, HTTP-focused, requires external process.

B. Use OpenTelemetry Format

Rejected: Designed for observability, not replay. Could add as parallel export later.

C. Add to IResultExporter

Rejected: Wrong abstraction — exporters run after tests, traces capture during.

D. HTTP-Level Recording (VCR Pattern)

Rejected: Doesn't capture tool call semantics, multi-turn state, or agent-level concerns.

References

VCR (Ruby) — Original cassette pattern
Azure SDK Test Framework — .NET gold standard
OpenTelemetry Traces — Observability standard
ADR-002: Result Directory Structure — Related file organization decisions

Accepted: January 7, 2026

Table of Contents