ADR-014: Dataset Pipeline — Two-Model Architecture
Status: Accepted
Date: 2026-02-24
Decision Makers: AgentEval Contributors
Related Document: strategy/AgentEval-dataloader-Implementation-Review-and-Refinement.md (Conflicts C, D)
Context
AgentEval's dataset-driven evaluation pipeline involves two distinct models that represent "a test case" at different abstraction layers:
| Model | Location | Purpose |
|---|---|---|
DatasetTestCase |
src/AgentEval/DataLoaders/IDatasetLoader.cs |
Persistence model — loaded from .jsonl, .json, .csv, .yaml files; tolerates format aliases; is format-agnostic |
TestCase |
src/AgentEval/Models/EvaluationModels.cs |
Execution model — consumed by IEvaluationHarness, StochasticRunner, assertions; has strict typed requirements |
The Problem
Documentation, samples, and user-facing guides had started treating the two models as synonymous. Specifically:
docs/getting-started.mdshowed a YAML dataset file usingname:,expected_output_contains:, andevaluation_criteria:as field names — all of which areTestCaseproperties, notDatasetTestCaseproperties. Loading such a file viaIDatasetLoadersilently drops those fields intoMetadata.No official bridge existed between
DatasetTestCaseandTestCase. Users had to write manual mapping code, which inevitably lost information (most critically:GroundTruthToolCall→stringprojection had no guidance).DatasetTestCasecould not carryevaluation_criteria— a value authors legitimately want to specify in their dataset files for use by the LLM judge.
Forces
- CLEAN Architecture (enforced in this project): persistence concerns must not bleed into domain models.
- DRY: every consumer of dataset files was writing the same mapping boilerplate.
- Type safety:
DatasetTestCase.GroundTruthisGroundTruthToolCall(structured, BFCL-style function-call accuracy).TestCase.GroundTruthisstring(free text for LLM-as-judge). These are semantically different despite sharing a name. - Documentation trust: the primary getting-started guide must show fields that actually work.
Decision
Keep both models separate. DatasetTestCase is the persistence/input layer; TestCase is the domain/execution layer. The boundary between them is made explicit and well-supported.
Specific Changes
1. Extend DatasetTestCase with EvaluationCriteria
Add the following properties to DatasetTestCase, recognized from the corresponding fields in all four loaders (JSONL, JSON, CSV, YAML):
IReadOnlyList<string>? EvaluationCriteria— fromevaluation_criteriaIReadOnlyList<string>? Tags— fromtags(maps toTestCase.Tags)int? PassingScore— frompassing_score(maps toTestCase.PassingScorewhich isint, defaults toEvaluationDefaults.DefaultPassingScoreif null)
Rationale: EvaluationCriteria is the primary TestCase property that makes sense as a file-level specification. Without it, dataset authors cannot specify evaluation criteria without writing a custom ToTestCase() override for every project. Tags enables test filtering by category directly from dataset files. PassingScore allows per-test threshold overrides without code changes.
When adding new recognized fields, JsonParsingHelper.KnownPropertyNames must also be updated so these fields are not duplicated into Metadata for JSON/JSONL loaders. Additionally, the YAML loader’s private YamlTestCase DTO class requires corresponding properties and mapping in ConvertToDatasetTestCase().
2. Provide DatasetTestCaseExtensions.ToTestCase() as the official bridge
public static TestCase ToTestCase(
this DatasetTestCase d,
Func<GroundTruthToolCall?, string?>? groundTruthProjection = null) => new()
{
Name = string.IsNullOrEmpty(d.Id) ? d.Input[..Math.Min(50, d.Input.Length)] : d.Id,
Input = d.Input,
ExpectedOutputContains = d.ExpectedOutput,
EvaluationCriteria = d.EvaluationCriteria,
ExpectedTools = d.ExpectedTools,
GroundTruth = groundTruthProjection != null
? groundTruthProjection(d.GroundTruth)
: (d.GroundTruth is null ? null : JsonSerializer.Serialize(d.GroundTruth)),
Tags = d.Tags,
PassingScore = d.PassingScore ?? EvaluationDefaults.DefaultPassingScore, // int? → int
// Filter null values: DatasetTestCase.Metadata is Dictionary<string, object?>
// but TestCase.Metadata is IDictionary<string, object> (non-nullable values).
Metadata = d.Metadata.Count > 0
? d.Metadata
.Where(kv => kv.Value is not null)
.ToDictionary(kv => kv.Key, kv => kv.Value!)
: null,
};
3. GroundTruth projection default: JSON-serialize the structured value
When a DatasetTestCase has a GroundTruthToolCall and no custom projection is provided, ToTestCase() serializes it to a JSON string (e.g., {"name":"book_flight","arguments":{"city":"Paris"}}).
Rationale: Using the function name only (d.GroundTruth?.Name) silently discards argument data that the LLM judge could use to verify whether the agent called the tool with the correct parameters. JSON serialization preserves complete information and is human-readable by the judge.
Users who want name-only can pass d => d?.Name:
var testCase = datasetCase.ToTestCase(groundTruthProjection: gt => gt?.Name);
4. RunBatchAsync accepts IEnumerable<DatasetTestCase> directly
The batch evaluation API accepts DatasetTestCase and performs the ToTestCase() conversion internally, so callers using RunBatchAsync never need to manually bridge the models in the common case. Custom groundTruthProjection can be provided as an option if needed.
5. Fix documentation
docs/getting-started.md YAML example corrected:
name:→id:expected_output_contains:→expected:evaluation_criteria:stays valid (now recognized by all loaders per change 1)
Consequences
Positive
- CLEAN boundary maintained:
DatasetTestCaseremains a persistence concern;TestCaseremains a domain concern. Code depending onTestCasedoes not need to know about file formats. - No boilerplate for users:
ToTestCase()covers the common case in one line.RunBatchAsynchandles it transparently. - GroundTruth information preserved by default (JSON serialization), with escape hatch for customization.
- Documentation is now correct: the YAML example uses fields that
IDatasetLoaderactually recognizes. evaluation_criteriaround-trip: dataset authors can specify evaluation criteria in YAML/JSON/JSONL/CSV and have it flow through to the harness without any code.TagsandPassingScoreround-trip: dataset authors can also specifytagsfor test filtering andpassing_scorefor per-test threshold overrides.
Important Note: RunBatchAsync Return Type
RunBatchAsync returns TestSummary, which has TotalCount/PassedCount properties. Documentation must NOT reference .TotalTests/.PassedTests — those are properties of EvaluationReport (the exporter-layer type in Exporters/EvaluationReport.cs), which is a different type.
Negative / Trade-offs
- Extra type to learn: new users see two models and must understand the boundary. This is documented clearly in
getting-started.mdandcomparison.md. GroundTruthToolCall→stringserialization is a one-way projection; you cannot recover the structured value fromTestCase.GroundTruthalone. This is acceptable becauseTestCase.GroundTruthis consumed by the LLM judge as text.- Small breaking potential: any existing code that relied on
evaluation_criteria,tags, orpassing_scoresilently landing inMetadatawill now find the values in dedicatedDatasetTestCaseproperties instead. This is the correct behavior, but callers accessingMetadata["evaluation_criteria"]directly will break.
Alternatives Considered
A — Merge the models (single TestCase)
Make TestCase file-loadable: add snake_case aliases for all properties, accept question / prompt as Input, etc.
Rejected because:
- Persistence concerns (alias mapping, root-key detection) do not belong in the domain model.
TestCase.Nameisrequiredand has strict domain semantics (used in test output titles). Making it optional with a fallback violates that contract.- Violates CLEAN architecture principle already established in ADR-006.
B — Discriminated union TestCaseSource
Introduce TestCaseSource { FromFile(DatasetTestCase), Inline(TestCase) } and have the harness accept this union.
Rejected because:
- Adds a third type for what is fundamentally a two-step pipeline concern.
- Does not resolve the
GroundTruthtype mismatch — the projection still needs to happen somewhere. - Unnecessary complexity given that
ToTestCase()solves the problem without a new type.
C — Keep models separate but require users to bridge manually (status quo)
Rejected because:
- Documentation's
getting-started.mdexample already proves users get it wrong (it usedTestCasefield names for aDatasetTestCaseYAML file). - Every project writing the same
new TestCase { Name = d.Id, ... }mapping is the exact boilerplate a framework should eliminate.
D — Shared base class or interface ITestCaseBase
Evaluated and rejected because:
GroundTruthhas incompatible types across the two models (string?vsGroundTruthToolCall?). A shared interface cannot define this property.- Mutability contracts differ:
TestCaseusesinitsetters (compile-time immutability),DatasetTestCaseusesget/set(required for progressive deserialization from CSV/JSON/YAML). - Property names intentionally diverge:
Name(required) vsId(optional),ExpectedOutputContainsvsExpectedOutput. These are different semantics, not aliases. Metadatanullability differs:IDictionary<string, object>?vsDictionary<string, object?>. Neither contract can be weakened without downstream impact.- Only 2 properties (
Input,ExpectedTools) actually share both name and type — too thin for a useful abstraction. - Existing benchmark test case types (
ToolAccuracyTestCase,TaskCompletionTestCase,MultiStepTestCase) already don't share a base. Introducing inheritance only here would be architecturally inconsistent. - Extensibility for custom dataset formats is already covered by
IDatasetLoader(the interface) andMetadata(the bag). SubclassingDatasetTestCaseadds no valueIDatasetLoaderdoesn't already provide.
Implementation
See strategy/AgentEval-dataloader-Implementation-Review-and-Refinement.md for the full implementation plan, specifically:
- M3 —
DatasetTestCaseextended model +ToTestCase()adapter - M4 —
RunBatchAsynconMAFEvaluationHarnessacceptingDatasetTestCase - M1 — Documentation corrections