Table of Contents

Composite Evaluations

Overview

A composite evaluation aggregates multiple sub-evals into one scored result with a recursive tree of sub-results. This is the right primitive when a single pass/fail verdict must draw on several independent checks — for example, a GDPR Article 17 rollup across acknowledgment, backup propagation, legal obligation, and over-erasure checks; Foundry's tool-call-accuracy formula (0.25 × selection + 0.25 × input + ...); or a multi-judge consensus where each judge is one component.

Five aggregation strategies ship under AgentEval.Evals.Aggregations:

Strategy When to use
WeightedSumAggregation Default — score is Σ(weight_i × score_i) over components. Used by most agentic and GDPR Standard presets.
MinAggregation Score is the minimum across components — any weak component caps the verdict. Used inside EU AI Act Pillar 1 (Prohibited Practices).
CapByWorstAggregation Score is capped at the worst severity-weighted sub-score; surfaces critical failures even when other components score well. Used by GDPR / EU AI Act audit presets.
WeightedMedianAggregation Outlier-resistant alternative to weighted sum — useful when one judge in a multi-judge panel is known to skew scores.
MajorityVoteAggregation Pass/fail verdict by majority vote across components — used for stochastic-runs aggregation in agenteval bench gdpr --runs N.

Three kinds of evals

All three implement IEval and produce the same EvalResult shape. Callers never need to branch on type.

Kind Class How it scores
Atomic (LLM judge) AtomicLlmEval Delegates to an AgentEval.Core.IEvaluator; normalises the 0–100 score to 0–1.
Atomic (deterministic) AtomicCodeEval (abstract) Subclass implements Evaluate(input) synchronously; no LLM call.
Composite CompositeEval Runs all sub-evals in parallel via Task.WhenAll; aggregates via IAggregationStrategy.

All three live in the AgentEval.Evals namespace.


Quick start — atomic LLM-judge eval

AtomicLlmEval wraps an existing AgentEval.Core.IEvaluator instance (obtain one from your evaluator factory or DI container).

var ack = new AtomicLlmEval(
    evaluator: myEvaluator,                   // AgentEval.Core.IEvaluator
    key:       "art17_acknowledgment",
    name:      "Article 17 - Acknowledgment",
    category:  "compliance.gdpr",
    version:   "1.0.0",
    criteria:  new[] { "Acknowledgment is explicit and timely" },
    passThreshold: 0.70);                     // default; omit to keep 0.70

var result = await ack.EvaluateAsync(new EvalInput(
    Query:    "I want my data deleted.",
    Response: "We confirm receipt of your request..."));

Console.WriteLine(result.Score.Value);   // 0..1
Console.WriteLine(result.Score.Label);  // "pass" or "fail"

Quick start — composite eval

The example below mirrors the GDPR Article 17 worked example from the internal design notes.

// 1. Build four atomic sub-evals (using stub implementations here for brevity).
var acknowledgment   = new Art17AcknowledgmentEval(myEvaluator);   // AtomicLlmEval subclass
var backupPropagation = new Art17BackupPropagationEval(myEvaluator);
var legalObligation  = new Art17LegalObligationEval(myEvaluator);
var noOvererasure    = new Art17NoOvererasureEval(myEvaluator);

// 2. Wrap them in a CompositeEval with weights that sum to 1.0 and a pass threshold.
var components = new EvalComponent[]
{
    new(acknowledgment,    Weight: 0.30),
    new(backupPropagation, Weight: 0.30),
    new(legalObligation,   Weight: 0.20),
    new(noOvererasure,     Weight: 0.20),
};

var article17 = new CompositeEval(
    key:        "gdpr_article_17",
    name:       "Article 17 - Right to erasure",
    category:   "compliance.gdpr",
    version:    "1.0.0",
    components: components,
    aggregation: WeightedSumAggregation.Instance,
    threshold:   0.80);

// 3. Evaluate.
var input  = new EvalInput(Query: "Please delete all my personal data.");
var result = await article17.EvaluateAsync(input);

// 4. Inspect the result.
Console.WriteLine(result.Score.Value);            // weighted sum, e.g. 0.675
Console.WriteLine(result.Score.Severity);         // max severity of sub-evals, e.g. "high"
Console.WriteLine(result.Score.Label);            // "pass" or "fail" (threshold-driven)

foreach (var sub in result.Details.SubResults!)
{
    Console.WriteLine($"  {sub.Metric.Key}: {sub.Score.Value:F2} ({sub.Score.Label})");
}

Composites can nest. Because CompositeEval implements IEval, it can itself appear as a component inside another CompositeEval. The recursive tree of sub-results is preserved all the way down.


Verdict matrix

The composite verdict is determined after aggregation. warn is a soft fail: Passed = false but Label = "warn" distinguishes it from a hard fail.

Threshold set? Condition Label Passed
Yes score >= threshold "pass" true
Yes score < threshold "fail" false
No severity is critical or high "fail" false
No severity is medium "warn" false
No severity is none or low "pass" true

Composite severity is the maximum severity across all sub-results (none < low < medium < high < critical), computed by SeverityRollup.Max.


Persistence to the canonical store

Composite results go through EvalResultPersistence to reach the canonical IOutputStore.

// Serialise the recursive tree into a ScenarioResult.
var sr = EvalResultPersistence.ToScenarioResult(
    result,
    scenarioId:   "art17",
    scenarioName: "Article 17 - Right to erasure");

// Write to the store.
await store.WriteScenarioResultAsync(runId, sr);

ToScenarioResult lifts Score.Value, Score.Passed, Details.Dimensions, and Provenance.EstimatedCost to the top-level ScenarioResult fields for queryability. The full recursive tree — including every level of sub-results — is serialised as JSON inside ScenarioResult.Output.

To restore the tree:

var restored = EvalResultPersistence.FromScenarioResult(sr);

The existing ContentHasher.HashRunAsync covers the embedded JSON, so the audit chain (agenteval doctor) extends to composite results without any schema or store changes.


DI registration

services.AddCompositeEvals();

Registers WeightedSumAggregation as the default IAggregationStrategy using TryAdd semantics. If your container already has an IAggregationStrategy registration, it is preserved.


Schema validation

The eval-result.schema.json v1 schema is embedded as a resource in the AgentEval.DataLoaders assembly and covers the recursive EvalResult tree (including nested sub-results). Use the existing SchemaValidator helper (internal) when building tools that need to validate persisted results.


Deferred features

The following are deferred until a concrete consumer asks for them:

  • StrictAggregation — any failure fails the composite (MinAggregation covers most use cases today).
  • MedianAggregation — unweighted median for multi-judge consensus (WeightedMedianAggregation covers the weighted case).
  • Predicate field on EvalComponent — conditional component skipping with automatic weight renormalization.
  • YAML composite authoring — declare composites in config rather than code.
  • Sub-eval result caching — reuse a sub-eval result across multiple composites.
  • Streaming events on sub-eval completion — notify callers as each sub-eval finishes.
  • Hierarchical calibration — per-level score adjustment for nested composites.

See also