ADR-007: Metrics Taxonomy and Categorization

Status

Accepted

Date

2026-01-15

Context

AgentEval has grown to include 14+ metrics across different categories (RAG, Agentic, Embedding, Conversation). As we plan to add more metrics based on analysis of Azure AI Evaluation SDK and industry standards, we need a clear taxonomy to:

Help users discover and choose appropriate metrics
Guide implementation decisions for new metrics
Maintain consistency in naming and behavior
Support future extensibility (safety metrics, multimodal, etc.)

Current State

Existing interface hierarchy:

IMetric
├── IRAGMetric (RequiresContext, RequiresGroundTruth)
└── IAgenticMetric (RequiresToolUsage)

Naming prefixes: llm_, code_, embed_

Problem

Flat categorization: All quality metrics are either RAG or Agentic - no room for safety, fluency, coherence
Missing metadata: No programmatic way to query metric categories, costs, or requirements
Limited discoverability: Users must read documentation to understand what metrics are available

Decision

1. Extend Interface Hierarchy (Optional Interfaces)

Add optional marker interfaces for additional capabilities:

/// <summary>
/// Marker interface for quality evaluation metrics.
/// Quality metrics assess the substantive quality of agent responses.
/// </summary>
public interface IQualityMetric : IMetric { }

/// <summary>
/// Marker interface for safety evaluation metrics.
/// Safety metrics assess potential harms, toxicity, or policy violations.
/// </summary>
public interface ISafetyMetric : IMetric { }

/// <summary>
/// Marker interface for performance evaluation metrics.
/// Performance metrics assess efficiency, latency, and resource usage.
/// </summary>
public interface IPerformanceMetric : IMetric { }

2. Add MetricCategory Enumeration

[Flags]
public enum MetricCategory
{
    None = 0,
    
    // Data requirements
    RequiresContext = 1 << 0,
    RequiresGroundTruth = 1 << 1,
    RequiresToolUsage = 1 << 2,
    RequiresEmbeddings = 1 << 3,
    
    // Evaluation domain
    RAG = 1 << 4,
    Agentic = 1 << 5,
    Conversation = 1 << 6,
    Safety = 1 << 7,
    
    // Quality aspects
    Faithfulness = 1 << 8,
    Relevance = 1 << 9,
    Coherence = 1 << 10,
    Fluency = 1 << 11,
    
    // Computation method
    LLMBased = 1 << 12,
    EmbeddingBased = 1 << 13,
    CodeBased = 1 << 14
}

3. Extend IMetric with Optional Metadata

public interface IMetric
{
    string Name { get; }
    Task<MetricResult> EvaluateAsync(EvaluationContext context);
    
    // Optional metadata (default implementations)
    MetricCategory Categories => MetricCategory.None;
    string? Description => null;
    decimal? EstimatedCostPerEvaluation => null;
}

4. Metric Naming Convention (Confirmed)

Retain existing prefixes with formal definition:

Prefix	Meaning	Categories Flag	Example
`llm_`	Requires LLM API call	`MetricCategory.LLMBased`	`llm_faithfulness`
`code_`	Computed by code only	`MetricCategory.CodeBased`	`code_tool_success`
`embed_`	Requires embedding API	`MetricCategory.EmbeddingBased`	`embed_answer_similarity`

5. Metric Discovery Service

public interface IMetricRegistry
{
    IReadOnlyList<IMetric> GetAllMetrics();
    IReadOnlyList<IMetric> GetMetricsByCategory(MetricCategory category);
    IMetric? GetMetricByName(string name);
    void Register(IMetric metric);
}

Rationale

Why Flags Enum?

A flags enum allows combining multiple categories:

Categories = MetricCategory.RAG | MetricCategory.RequiresContext | MetricCategory.LLMBased

This is more flexible than a single category assignment.

Why Optional Interfaces?

Marker interfaces like IQualityMetric and ISafetyMetric:

Enable compile-time type safety for metric filtering
Support DI registration patterns (services.AddSingleton<ISafetyMetric, ToxicityMetric>())
Allow grouping without breaking existing code

Why Not Break Existing Interfaces?

The new categories and metadata are additive with default values. Existing metric implementations continue to work unchanged.

Consequences

Positive

Better discoverability: Users can query metrics by category
Future-proof: Safety, multimodal metrics fit naturally
Tooling support: CLI can list metrics by category
Cost awareness: Estimated costs are queryable

Negative

More code: Additional interfaces and enum to maintain
Optional complexity: Developers must decide which interfaces to implement
Migration effort: Existing metrics should be updated with categories (non-breaking)

Migration Path

Add new interfaces and enum (Phase 1)
Update existing metrics with Categories property (Phase 2)
Add IMetricRegistry service (Phase 3)
Update CLI to use categories (Phase 4)

Alternatives Considered

1. Attribute-Based Categorization

[MetricCategory(Category.RAG, Category.LLMBased)]
public class FaithfulnessMetric : IRAGMetric { }

Rejected: Requires reflection, harder to query at runtime, doesn't support DI patterns.

2. Separate Category Hierarchy

IMetric
├── IQualityMetric
│   ├── IRAGMetric
│   └── IAgenticMetric
└── ISafetyMetric

Rejected: Too rigid. Some metrics span categories (e.g., a safety metric that's also RAG).

3. No Formal Taxonomy

Keep flat structure, rely on documentation.

Rejected: Doesn't scale as metric count grows. Poor tooling support.

Authors: AI-assisted planning, January 2026

Table of Contents