Metrics Reference

Complete catalog of AgentEval evaluation metrics

Overview

AgentEval provides metrics across several categories to evaluate AI agents and RAG systems. Each metric is prefixed to indicate its computation method and cost.

Metric Prefixes

Prefix	Computation Method	Cost	Best For
`llm_`	LLM-as-judge evaluation	$$ (API calls)	Detailed quality analysis, production sampling
`code_`	Computed by code logic	Free	CI/CD, unit tests, high-volume testing
`embed_`	Embedding similarity	$ (embedding API)	Cost-effective semantic checks

RAG Metrics

Metrics for evaluating Retrieval-Augmented Generation systems.

llm_faithfulness

Purpose: Measures whether the response is grounded in the provided context (no hallucinations).

Property	Value
Interface	`IRAGMetric`
Requires Context	✅ Yes
Requires Ground Truth	❌ No
Cost	LLM API call

When to Use:

Validating RAG systems don't hallucinate
Ensuring responses are supported by retrieved documents
Production quality checks

Example:

var metric = new FaithfulnessMetric(chatClient);
var result = await metric.EvaluateAsync(new EvaluationContext
{
    Input = "What is the capital of France?",
    Output = "Paris is the capital of France.",
    Context = "France is a country in Europe. Its capital is Paris."
});
// result.Score: 100 (fully faithful)

llm_relevance

Purpose: Measures how relevant the response is to the user's question.

Property	Value
Interface	`IRAGMetric`
Requires Context	❌ No
Requires Ground Truth	❌ No
Cost	LLM API call

When to Use:

Evaluating if responses address the question
General quality checks for any LLM output
Detecting off-topic responses

llm_context_precision

Purpose: Measures how much of the retrieved context is actually useful for answering.

Property	Value
Interface	`IRAGMetric`
Requires Context	✅ Yes
Requires Ground Truth	❌ No
Cost	LLM API call

When to Use:

Optimizing retrieval quality
Reducing context noise
Improving retrieval efficiency

llm_context_recall

Purpose: Measures whether the context contains all information needed for the correct answer.

Property	Value
Interface	`IRAGMetric`
Requires Context	✅ Yes
Requires Ground Truth	✅ Yes
Cost	LLM API call

When to Use:

Evaluating retrieval completeness
Ensuring no critical information is missed
Testing retrieval coverage

llm_answer_correctness

Purpose: Measures factual correctness compared to ground truth.

Property	Value
Interface	`IRAGMetric`
Requires Context	❌ No
Requires Ground Truth	✅ Yes
Cost	LLM API call

When to Use:

QA system evaluation
Fact-checking responses
Regression testing with known answers

Embedding Metrics

Fast, cost-effective semantic similarity metrics using embeddings.

embed_answer_similarity

Purpose: Measures semantic similarity between response and ground truth.

Property	Value
Interface	`IRAGMetric`
Requires Context	❌ No
Requires Ground Truth	✅ Yes
Cost	Embedding API call

When to Use:

Fast similarity checks at scale
When exact wording doesn't matter
Cost-effective alternative to llm_answer_correctness

Limitations:

Cannot detect subtle factual errors
May miss negation differences

embed_response_context

Purpose: Measures how semantically grounded the response is in the context.

Property	Value
Interface	`IRAGMetric`
Requires Context	✅ Yes
Requires Ground Truth	❌ No
Cost	Embedding API call

When to Use:

Fast grounding check
Pre-filter before expensive LLM evaluation

embed_query_context

Purpose: Measures relevance of retrieved context to the query.

Property	Value
Interface	`IRAGMetric`
Requires Context	✅ Yes
Requires Ground Truth	❌ No
Cost	Embedding API call

When to Use:

Evaluating retrieval relevance
Optimizing search/retrieval systems

Agentic Metrics

Metrics for evaluating tool-using AI agents.

code_tool_selection

Purpose: Measures whether the agent selected the correct tools.

Property	Value
Interface	`IAgenticMetric`
Requires Tool Usage	✅ Yes
Cost	Free (code-based)

When to Use:

CI/CD pipeline tests
Validating tool selection logic
High-volume testing

Example:

var metric = new ToolSelectionMetric(
    expectedTools: ["SearchTool", "CalculatorTool"],
    strictOrder: false);

var result = await metric.EvaluateAsync(context);

code_tool_arguments

Purpose: Validates tool arguments are complete and correct.

Property	Value
Interface	`IAgenticMetric`
Requires Tool Usage	✅ Yes
Cost	Free (code-based)

When to Use:

Ensuring required parameters are provided
Validating argument formats

code_tool_success

Purpose: Measures tool execution success rate.

Property	Value
Interface	`IAgenticMetric`
Requires Tool Usage	✅ Yes
Cost	Free (code-based)

When to Use:

Monitoring agent reliability
Detecting tool failures

code_tool_efficiency

Purpose: Measures efficiency of tool usage (call count, retries, duration).

Property	Value
Interface	`IAgenticMetric`
Requires Tool Usage	✅ Yes
Cost	Free (code-based)

When to Use:

Optimizing agent efficiency
Detecting excessive retries
Cost optimization

llm_task_completion

Purpose: Evaluates whether the agent completed the requested task.

Property	Value
Interface	`IAgenticMetric`
Requires Tool Usage	❌ No
Cost	LLM API call

When to Use:

End-to-end task validation
Complex multi-step task evaluation

Conversation Metrics

Metrics for multi-turn conversation evaluation.

ConversationCompleteness

Purpose: Evaluates multi-turn conversation quality including response rate, tool usage, and flow.

Property	Value
Type	Special (non-IMetric)
Requires	ConversationResult
Cost	Free (code-based)

Sub-scores:

ResponseRate (40%) - Did every turn get a response?
ToolUsage (30%) - Were expected tools called?
DurationCompliance (15%) - Within time limits?
ErrorFree (15%) - No errors occurred?

Information Retrieval Metrics

Code-based metrics for evaluating retrieval quality in RAG systems. These are FREE (no API calls) and fast.

code_recall_at_k

Purpose: Measures what proportion of relevant documents were found in the top K retrieved results.

Property	Value
Interface	`IRAGMetric`
Requires Context	❌ No
Requires Ground Truth	❌ No
Requires Document IDs	✅ Yes (`RelevantDocumentIds`, `RetrievedDocumentIds`)
Cost	Free (code-based)
Formula	`Recall@K = \|Relevant ∩ Retrieved@K\| / \|Relevant\|`

When to Use:

Evaluating retrieval coverage in RAG pipelines
CI/CD testing of vector search
Optimizing retrieval parameters (K value)
High-volume testing (free - no API costs)

Parameters:

k - Number of top results to consider (default: 10)
passThreshold - Minimum score to pass (default: 0.7 = 70%)

Example:

var metric = new RecallAtKMetric(k: 5);
var result = await metric.EvaluateAsync(new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2", "doc3"],
    RetrievedDocumentIds = ["doc1", "doc4", "doc2", "doc5", "doc6"]
});
// Score: 67 (2 of 3 relevant docs found in top 5)

AdditionalData:

k - K value used
relevant_count - Total relevant documents
retrieved_at_k - Documents retrieved at K
relevant_found_at_k - Relevant documents found

code_mrr

Purpose: Mean Reciprocal Rank - measures how early the first relevant document appears in results.

Property	Value
Interface	`IRAGMetric`
Requires Context	❌ No
Requires Ground Truth	❌ No
Requires Document IDs	✅ Yes (`RelevantDocumentIds`, `RetrievedDocumentIds`)
Cost	Free (code-based)
Formula	`MRR = 1 / rank_of_first_relevant`

When to Use:

Evaluating retrieval ranking quality
User experience optimization (users want relevant docs first)
Search result quality assessment
CI/CD testing of ranking algorithms

Parameters:

maxRank - Maximum rank to consider (default: unlimited, 0 = unlimited)
passThreshold - Minimum score to pass (default: 0.33 = first relevant in top 3)

Scoring:

First relevant at rank 1 → Score = 100 (1/1)
First relevant at rank 2 → Score = 50 (1/2)
First relevant at rank 3 → Score = 33 (1/3)
First relevant at rank 10 → Score = 10 (1/10)
No relevant found → Score = 0

Example:

var metric = new MRRMetric(maxRank: 10);
var result = await metric.EvaluateAsync(new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2"],
    RetrievedDocumentIds = ["docA", "docB", "doc1", "docC"]  // doc1 at rank 3
});
// Score: 33 (1/3 = 0.333)

AdditionalData:

first_relevant_rank - Position of first relevant doc (0 if none found)
max_rank - Maximum rank considered (0 = unlimited)
reciprocal_rank - The 1/rank value

Quality Metrics

Metrics for evaluating response quality, safety, and language characteristics.

llm_groundedness

Purpose: Measures whether the response is grounded and avoids unsubstantiated claims, fabricated sources, or presenting speculation as fact.

Property	Value
Interface	`IRAGMetric`, `ISafetyMetric`
Requires Context	✅ Yes
Requires Ground Truth	❌ No
Cost	LLM API call

When to Use:

Safety validation for AI responses
Detecting fabricated sources, citations, or statistics
Ensuring claims are substantiated by context
High-stakes applications requiring factual accuracy

Example:

var metric = new GroundednessMetric(chatClient);
var result = await metric.EvaluateAsync(new EvaluationContext
{
    Input = "What are the sales numbers?",
    Output = "According to our Q3 report, sales increased 15%.",
    Context = "Q3 sales report shows 15% growth year-over-year."
});
// result.Score: High (claim supported by context)

llm_coherence

Purpose: Measures the logical coherence and internal consistency of the response.

Property	Value
Interface	`IRAGMetric`, `IQualityMetric`
Requires Context	❌ No
Requires Ground Truth	❌ No
Cost	LLM API call

When to Use:

Detecting self-contradictions in responses
Evaluating logical flow and structure
Quality assurance for complex, multi-part responses
Ensuring ideas connect naturally

Example:

var metric = new CoherenceMetric(chatClient);
var result = await metric.EvaluateAsync(new EvaluationContext
{
    Input = "Explain the process",
    Output = "First, prepare the ingredients. The final step is preparation..."
});
// result.Score: Low (contradictory: preparation both first and last)

llm_fluency

Purpose: Measures grammar, readability, and natural language quality of the response.

Property	Value
Interface	`IRAGMetric`, `IQualityMetric`
Requires Context	❌ No
Requires Ground Truth	❌ No
Cost	LLM API call

When to Use:

Grammar and style checking
Readability assessment
Language quality assurance
Customer-facing content validation

Example:

var metric = new FluencyMetric(chatClient);
var result = await metric.EvaluateAsync(new EvaluationContext
{
    Output = "The product is very good and works excellent for many purpose."
});
// result.Score: Moderate (minor grammar issues)

Workflow Metrics

Metrics for evaluating multi-agent workflows and complex execution pipelines.

code_workflow_structure_validity

Purpose: Validates workflow graph structure (nodes, edges, entry/exit points).

Property	Value
Interface	`IWorkflowMetric`
Requires Workflow Graph	✅ Yes
Cost	Free (code validation)

When to Use:

Ensuring workflow topology is correct
Validating graph connectivity
CI/CD workflow validation
Detecting broken workflow definitions

Example:

var metric = new WorkflowStructureValidityMetric();
var result = await metric.EvaluateAsync(new WorkflowEvaluationContext
{
    WorkflowGraph = workflowDefinition,
    ExpectedNodes = new[] { "Planner", "Writer", "Editor" }
});
// result.Score: 100 (valid structure)

code_workflow_execution_order

Purpose: Verifies agents executed in the expected order.

Property	Value
Interface	`IWorkflowMetric`
Requires Execution Result	✅ Yes
Cost	Free (sequence validation)

When to Use:

Sequential pipeline validation
Dependency checking
Ensuring proper workflow orchestration
Debugging execution flow

Example:

var metric = new WorkflowExecutionOrderMetric(
    expectedOrder: new[] { "Planner", "Researcher", "Writer", "Editor" }
);
var result = await metric.EvaluateAsync(context);
// result.Score: Based on sequence correctness

code_workflow_executor_success

Purpose: Measures the success rate of individual executors within the workflow.

Property	Value
Interface	`IWorkflowMetric`
Requires Executor Results	✅ Yes
Cost	Free (success rate calculation)

When to Use:

Per-agent reliability tracking
Identifying problematic workflow steps
Quality assurance for complex pipelines
Performance monitoring

Example:

var metric = new WorkflowExecutorSuccessMetric();
var result = await metric.EvaluateAsync(context);
// result.Score: Percentage of successful executor completions

llm_workflow_output_quality

Purpose: LLM-based evaluation of final workflow output quality.

Property	Value
Interface	`IWorkflowMetric`
Requires Workflow Output	✅ Yes
Cost	LLM API call

When to Use:

End-to-end workflow quality assessment
Complex pipeline output validation
Production quality gates
Multi-agent collaboration evaluation

Example:

var metric = new WorkflowOutputQualityMetric(chatClient, criteria: 
    "Evaluate if the content creation workflow produced high-quality, coherent output"
);
var result = await metric.EvaluateAsync(context);
// result.Score: LLM assessment of final output quality

code_workflow_tool_chain_validity

Purpose: Validates tool usage patterns across multiple agents in the workflow.

Property	Value
Interface	`IWorkflowMetric`
Requires Tool Usage	✅ Yes
Cost	Free (tool pattern analysis)

When to Use:

Multi-agent tool coordination
Tool dependency validation
Complex tool chain verification
Cross-agent tool state management

Example:

var metric = new WorkflowToolChainValidityMetric(
    requiredPatterns: new[] {
        new ToolPattern("Search", beforeTool: "Book"),
        new ToolPattern("Confirm", beforeTool: "Execute")
    }
);
var result = await metric.EvaluateAsync(context);
// result.Score: Tool chain pattern compliance

code_workflow_performance_efficiency

Purpose: Measures overall workflow performance (timing, cost, resource usage).

Property	Value
Interface	`IWorkflowMetric`
Requires Performance Data	✅ Yes
Cost	Free (performance calculation)

When to Use:

Workflow optimization
Cost monitoring across multiple agents
SLA validation for complex pipelines
Performance regression detection

Example:

var metric = new WorkflowPerformanceEfficiencyMetric(
    maxDuration: TimeSpan.FromMinutes(5),
    maxTotalCost: 0.50m
);
var result = await metric.EvaluateAsync(context);
// result.Score: Based on performance vs. thresholds

Quick Reference Table

Metric	Category	Context	Ground Truth	Tool Usage	Cost
`llm_faithfulness`	RAG	✅	❌	❌	LLM
`llm_relevance`	RAG	❌	❌	❌	LLM
`llm_context_precision`	RAG	✅	❌	❌	LLM
`llm_context_recall`	RAG	✅	✅	❌	LLM
`llm_answer_correctness`	RAG	❌	✅	❌	LLM
`embed_answer_similarity`	Embedding	❌	✅	❌	Embed
`embed_response_context`	Embedding	✅	❌	❌	Embed
`embed_query_context`	Embedding	✅	❌	❌	Embed
`code_recall_at_k`	IR	❌	❌	❌	Free
`code_mrr`	IR	❌	❌	❌	Free
`code_tool_selection`	Agentic	❌	❌	✅	Free
`code_tool_arguments`	Agentic	❌	❌	✅	Free
`code_tool_success`	Agentic	❌	❌	✅	Free
`code_tool_efficiency`	Agentic	❌	❌	✅	Free
`llm_task_completion`	Agentic	❌	❌	❌	LLM
`llm_groundedness`	Quality	✅	❌	❌	LLM
`llm_coherence`	Quality	❌	❌	❌	LLM
`llm_fluency`	Quality	❌	❌	❌	LLM
`code_workflow_structure_validity`	Workflow	❌	❌	❌	Free
`code_workflow_execution_order`	Workflow	❌	❌	❌	Free
`code_workflow_executor_success`	Workflow	❌	❌	❌	Free
`llm_workflow_output_quality`	Workflow	❌	❌	❌	LLM
`code_workflow_tool_chain_validity`	Workflow	❌	❌	✅	Free
`code_workflow_performance_efficiency`	Workflow	❌	❌	❌	Free

Table of Contents

Metrics Reference

Overview

Metric Prefixes

RAG Metrics

llm_faithfulness

llm_relevance

llm_context_precision

llm_context_recall

llm_answer_correctness

Embedding Metrics

embed_answer_similarity

embed_response_context

embed_query_context

Agentic Metrics

code_tool_selection

code_tool_arguments

code_tool_success

code_tool_efficiency

llm_task_completion

Conversation Metrics

ConversationCompleteness

Information Retrieval Metrics

code_recall_at_k

code_mrr

Quality Metrics

llm_groundedness

llm_coherence

llm_fluency

Workflow Metrics

code_workflow_structure_validity

code_workflow_execution_order

code_workflow_executor_success

llm_workflow_output_quality

code_workflow_tool_chain_validity

code_workflow_performance_efficiency

Quick Reference Table

See Also