ADR-019: Chat-Boundary Tracing and the Two-Layer Recording Model
Status
Accepted — delivered as Glass Box v0.11 on branch feat/glass-box-v0.11-chat-boundary-tracing. Builds on ADR-004 (trace record/replay) and the modularization of ADR-016.
Date
2026-05-31
Context
AgentEval recorded only what the agent framework reported: TraceRecordingAgent wraps an IEvaluableAgent and captures one request/response per invocation — exactly what the agent boundary chooses to surface. When a MAF agent runs an internal plan-act-observe loop that hits the model N times, the recorder sees one top-level call and whatever subset of the loop the framework exposes. Retried turns, intermediate tool failures, injected system prompts, per-turn finish reasons, and provider verdicts are invisible by design. For black-box behavioural tests this is fine; for audit-grade compliance evidence and for detecting framework-level honesty violations it is not.
We need a second capture layer one level down — at the Microsoft.Extensions.AI.IChatClient boundary — where every LLM round-trip is visible verbatim before any agent layer summarises, retries, or filters it. MEAI 10.5.0 ships the required base types (DelegatingChatClient with virtual GetResponseAsync/GetStreamingResponseAsync; ChatClientBuilder.Use), confirmed by reflection, so no hand-rolled shim is needed.
Decision
Add
TraceRecordingChatClient : DelegatingChatClientinAgentEval.Core(AgentEval.Tracing). It records oneTraceEntryper LLM round-trip (Scope = ChatTurn), capturing the verbatim system prompt, tool definitions, per-turn text, finish reason, request options, provider metadata, token usage, and latency. Placing it in Core (MEAI-only, no MAF dependency) makes the capability reach anyIChatClient— Semantic Kernel, custom orchestration, raw loops — not just MAF.Composition rule (load-bearing): because
ChatClientBuildermakes the first.Use(...)the outermost layer andFunctionInvokingChatClient(FICC) calls its inner client once per round-trip,UseTraceRecordingmust be composed inner ofUseFunctionInvocationto observe every round-trip. Composing it outer records one entry for the whole loop.Tool-execution seam: add
EvaluatingAIFunction : AIFunctioninAgentEval.MAFto capture whatIChatClientmiddleware structurally cannot — the function execution (timing, args, result, exception) — as aToolExecution-scoped entry. It calls the inner function's publicInvokeAsync(not theprotectedInvokeCoreAsync, which CS1540 forbids through a base-typed reference).Correlation: a caller-established
ToolCorrelationScope(AsyncLocal, in Core so both the Core recorder and the MAF tool wrapper can read it) stamps every entry of one invocation with a sharedCorrelationId.Indexremains the request/response pairing key;CorrelationIdis the grouping key.The two recorders are complementary, not competing.
ChatTraceRecorder(conversation replay, one entry per user turn) is unchanged;TraceRecordingChatClientanswers "what did the model see on every call."agenteval doctorwarns when both wrap the same agent (double-wrapping).Trace Fidelity (a Shape-B benchmark family on
BenchmarkFamilyRegistry, per ADR-017) reconciles the two layers and flags missing/phantom calls, hidden retries, argument drift, token under-reporting, and suppressed finish reasons — auditing the framework's honesty.
Consequences
Positive
- White-box, audit-grade evidence: every round-trip, tool schema, finish reason, and provider verdict is observable and hash-anchorable.
- Framework-agnostic by construction (recorder is MEAI-only in Core).
- A structurally novel capability (Trace Fidelity) that no two-layer-less competitor can offer, plus an upstream-bug-report loop to
microsoft/agent-framework. - A runtime policy gate (guardrails) reuses the same
IChatClientseam.
Negative / costs
- Trace volume grows (~N× for an N-turn loop); mitigated by tool-definition de-dup in Smoke/Standard presets (off in AuditGrade).
- Composition order is load-bearing and easy to get wrong; mitigated by docs, the
doctorcheck, and a pinned composition test. - Per-turn↔tool correlation is best-effort (per-invocation grouping + tool
CallId); precise under sequential tools, documented caveat under parallel tools.
Alternatives Considered
- Agent-boundary only (status quo). Rejected: cannot surface hidden retries, suppressed finish reasons, or per-turn evidence.
- A hand-rolled
IChatClientdecorator shim. Rejected: MEAI 10.5.0 already shipsDelegatingChatClient/ChatClientBuilder.Use(verified by reflection), so the shim is unnecessary. - Recorder in
AgentEval.MAF. Rejected: would tie chat-boundary capture to MAF and forfeit SK/raw-loop reach; Core placement keeps it universal. - Subclassing
TraceEntryfor chat-turn/tool entries. Rejected: the trace serializes a flatList<TraceEntry>with reflection-based STJ; subclasses would break it. Additive nullable fields + static factory methods are used instead (see ADR-020).
Invariants (do not "fix" these)
These are load-bearing and easy to mistake for accidents:
- Two
AgentTracetypes are intentional.AgentEval.Tracing.AgentTrace(the working class used by recorders/replay) is distinct fromAgentEval.Output.AgentTrace(the IOutputStore record). They are not duplication — do not merge them. - The recorder must be composed inner of
UseFunctionInvocation.ChatClientBuilder.Useapplies first-registered-outermost, so the recorder sits inside the tool loop and captures one entry per model round-trip; placed outer it records a single entry for the whole loop. Pinned byTraceRecordingChatClientTests.ComposedInnerOfFunctionInvocation*and checked by thedoctorcommand. For MAF agents the runtime adds the loop, so wrapping the raw model withUseTraceRecordingalready lands the recorder inside it. - Agent-boundary reconstruction shares the recorder's serialization.
AgentBoundaryTraceBuilder.FromAgentResponse(...)rebuilds the framework's self-report (from MAFAgentResponse.Messages/.Usage/.FinishReason) for a Trace Fidelity reconciliation. It and the chat recorder both map tool calls viaTraceMapping.ToToolCallso arguments serialize byte-identically — otherwise a clean run would show a fabricatedargument_drift. Empirically (pinned byAgentBoundaryTraceBuilderTests), MEAI's function-invocation loop preserves every tool call in the final messages and sums token usage, so on a normal run the framework agrees on totals: it hides per-turn detail, it does not lie. TheReal vs Frameworkobservability samples demonstrate exactly this.