RAG Metrics Guide

Comprehensive evaluation for Retrieval-Augmented Generation systems

Overview

AgentEval provides 13 metrics specifically designed for evaluating RAG pipelines, covering the entire retrieval to generation to evaluation cycle.

Metric Categories

Category	Metrics	Cost	Best For
LLM-based	Faithfulness, Relevance, Context Precision/Recall, Answer Correctness	$$$	High-accuracy evaluation, production sampling
Quality	Groundedness, Coherence, Fluency	$$$	Safety validation, language quality
Embedding-based	Answer Similarity, Response-Context, Query-Context	$	Volume testing, fast feedback
Information Retrieval	Recall@K, MRR	FREE	CI/CD, retrieval optimization

When to Use Each

RAG Pipeline Stage          Recommended Metrics
---------------------------------------------------------
                           +-----------------------------+
  Query                    |                             |
    |                      |                             |
    v                      |                             |
+-----------+              |  code_recall_at_k (FREE)    |
| Retrieval |--------------|  code_mrr (FREE)            |
+-----------+              |  embed_query_context ($)    |
    |                      |  llm_context_precision ($$$)|
    |                      |  llm_context_recall ($$$)   |
    v                      |                             |
+-----------+              |                             |
| Generation|--------------|  llm_faithfulness ($$$)     |
+-----------+              |  embed_response_context ($) |
    |                      |                             |
    v                      |                             |
+-----------+              |  llm_answer_correctness ($$$|
|  Answer   |--------------|  embed_answer_similarity ($)|
+-----------+              |  llm_relevance ($$$)        |
                           +-----------------------------+

Information Retrieval Metrics (FREE)

These metrics evaluate retrieval quality using only document IDs - no API calls required.

code_recall_at_k

Purpose: Measures what proportion of relevant documents were found in the top K results.

Formula: Recall@K = |Relevant INTERSECT Retrieved@K| / |Relevant|

Property	Value
Interface	`IRAGMetric`
Cost	FREE
Default K	10
Pass Threshold	70%

Example:

var metric = new RecallAtKMetric(k: 5);

var context = new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2", "doc3"],      // Ground truth
    RetrievedDocumentIds = ["doc1", "doc4", "doc2", "doc5", "doc6"]  // Search results
};

var result = await metric.EvaluateAsync(context);
// Score: 67 (2 of 3 relevant docs found in top 5)

AdditionalData:

k - The K value used
relevant_count - Total relevant documents
retrieved_at_k - Documents retrieved at K
relevant_found_at_k - Relevant documents found in results

code_mrr

Purpose: Mean Reciprocal Rank - measures how early the first relevant document appears.

Formula: MRR = 1 / rank_of_first_relevant

Property	Value
Interface	`IRAGMetric`
Cost	FREE
Default maxRank	Unlimited
Pass Threshold	33% (first relevant in top 3)

Scoring:

First Relevant Rank	Score
1	100
2	50
3	33
10	10
Not found	0

Example:

var metric = new MRRMetric(maxRank: 10);

var context = new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2"],
    RetrievedDocumentIds = ["docA", "docB", "doc1", "docC"]  // doc1 at rank 3
};

var result = await metric.EvaluateAsync(context);
// Score: 33 (1/3 = 0.333)

AdditionalData:

first_relevant_rank - Position of first relevant doc (0 if none)
max_rank - Maximum rank considered
reciprocal_rank - The 1/rank value

Embedding Metrics ($)

Fast semantic similarity using vector embeddings - 10-100x faster than LLM evaluation.

embed_answer_similarity

Purpose: Measures semantic similarity between response and ground truth.

Property	Value
Interface	`IRAGMetric`
Requires	Ground Truth
Cost	~$0.0001/eval
Default Threshold	70

When to Use:

You have ground truth answers
Speed/cost are priorities
Semantic meaning matters more than exact wording

Example:

var embeddings = new OpenAIEmbeddings("text-embedding-3-small");
var metric = new AnswerSimilarityMetric(embeddings, passingThreshold: 80);

var context = new EvaluationContext
{
    Output = "Paris is the capital of France.",
    GroundTruth = "The capital of France is Paris."
};

var result = await metric.EvaluateAsync(context);
// Score: ~92 (high semantic similarity)

Limitations:

Cannot detect subtle factual errors (wrong numbers)
May miss negation differences ("is" vs "is not")

embed_response_context

Purpose: Measures how grounded the response is in the retrieved context.

Property	Value
Interface	`IRAGMetric`
Requires	Context
Cost	~$0.0001/eval
Default Threshold	50 (lower - responses may extend context)

When to Use:

Quick grounding check
Detecting hallucination beyond context
Pre-filter before expensive LLM evaluation

Example:

var metric = new ResponseContextSimilarityMetric(embeddings);

var context = new EvaluationContext
{
    Context = "Contoso sells widgets and gadgets in three product lines.",
    Output = "Contoso offers widgets, gadgets, and enterprise solutions."
};

var result = await metric.EvaluateAsync(context);
// High score = response grounded in context

embed_query_context

Purpose: Measures relevance of retrieved context to the user's query.

Property	Value
Interface	`IRAGMetric`
Requires	Input, Context
Cost	~$0.0001/eval
Default Threshold	70

When to Use:

Evaluating retriever quality
Debugging poor RAG responses
Retrieval system benchmarks

Example:

var metric = new QueryContextSimilarityMetric(embeddings);

var context = new EvaluationContext
{
    Input = "How do I reset my password?",
    Context = "To reset your password, navigate to Settings > Security..."
};

var result = await metric.EvaluateAsync(context);
// High score = context is relevant to query

LLM-Based Metrics ($$$)

Highest accuracy evaluation using LLM-as-judge patterns.

llm_faithfulness

Purpose: Measures whether the response is grounded in context (no hallucinations).

Property	Value
Interface	`IRAGMetric`
Requires	Context
Cost	~$0.01-0.05/eval

When to Use:

Validating RAG systems don't hallucinate
Production quality sampling
High-stakes applications

Example:

var metric = new FaithfulnessMetric(chatClient);

var context = new EvaluationContext
{
    Input = "What is the capital of France?",
    Output = "Paris is the capital of France and has a population of 2.1 million.",
    Context = "France is a country in Europe. Its capital is Paris."
};

var result = await metric.EvaluateAsync(context);
// Score depends on whether "2.1 million" is in context

llm_relevance

Purpose: Measures how relevant the response is to the user's question.

Property	Value
Interface	`IRAGMetric`
Requires	Input only
Cost	~$0.01-0.05/eval

When to Use:

General quality checks
Detecting off-topic responses
Any LLM output evaluation

llm_context_precision

Purpose: Measures how much of the retrieved context is actually useful.

Property	Value
Interface	`IRAGMetric`
Requires	Context
Cost	~$0.01-0.05/eval

When to Use:

Optimizing retrieval quality
Reducing context noise
Improving retrieval efficiency

llm_context_recall

Purpose: Measures whether context contains all information needed for the correct answer.

Property	Value
Interface	`IRAGMetric`
Requires	Context, Ground Truth
Cost	~$0.01-0.05/eval

When to Use:

Evaluating retrieval completeness
Ensuring no critical information missed
Testing retrieval coverage

llm_answer_correctness

Purpose: Measures factual correctness compared to ground truth.

Property	Value
Interface	`IRAGMetric`
Requires	Ground Truth
Cost	~$0.01-0.05/eval

When to Use:

QA system evaluation
Fact-checking responses
Regression testing with known answers

Cost Optimization Strategy

Tiered Evaluation Approach

// TIER 1: CI/CD - FREE metrics only
var ciMetrics = new IMetric[]
{
    new RecallAtKMetric(k: 10),
    new MRRMetric()
};

// TIER 2: Development - Add embedding metrics
var devMetrics = ciMetrics.Concat(new IMetric[]
{
    new AnswerSimilarityMetric(embeddings),
    new ResponseContextSimilarityMetric(embeddings)
});

// TIER 3: Production Sampling - Full LLM evaluation
var prodMetrics = devMetrics.Concat(new IMetric[]
{
    new FaithfulnessMetric(chatClient),
    new AnswerCorrectnessMetric(chatClient)
});

Fast-Then-Deep Pattern

// Quick embedding check first
var embedScore = await embedMetric.EvaluateAsync(context);

if (embedScore.Score < 60)
{
    // Very low - skip expensive LLM eval
    return TestResult.Fail("Response not grounded in context");
}

if (embedScore.Score < 85)
{
    // Borderline - escalate to LLM
    var llmScore = await faithfulnessMetric.EvaluateAsync(context);
    return llmScore;
}

return embedScore;  // High confidence pass

Sampling for Production

var sampleRate = 0.10;  // 10% of traffic

if (Random.Shared.NextDouble() < sampleRate)
{
    // Run expensive metrics on sample only
    await faithfulnessMetric.EvaluateAsync(context);
    await answerCorrectnessMetric.EvaluateAsync(context);
}

Complete RAG Evaluation Example

using AgentEval.Core;
using AgentEval.Metrics.RAG;
using AgentEval.Embeddings;

// Setup
var embeddings = new OpenAIEmbeddings("text-embedding-3-small");
var chatClient = GetAzureOpenAIChatClient();

// Define all RAG metrics
var metrics = new IMetric[]
{
    // FREE - Information Retrieval
    new RecallAtKMetric(k: 5),
    new MRRMetric(),
    
    // $ - Embedding-based
    new AnswerSimilarityMetric(embeddings),
    new ResponseContextSimilarityMetric(embeddings),
    new QueryContextSimilarityMetric(embeddings),
    
    // $$$ - LLM-based
    new FaithfulnessMetric(chatClient),
    new RelevanceMetric(chatClient),
    new ContextPrecisionMetric(chatClient),
    new ContextRecallMetric(chatClient),
    new AnswerCorrectnessMetric(chatClient)
};

// Prepare evaluation context
var context = new EvaluationContext
{
    Input = "What is AgentEval?",
    Output = "AgentEval is a .NET evaluation toolkit for AI agents.",
    Context = "AgentEval is the comprehensive .NET evaluation toolkit...",
    GroundTruth = "AgentEval is a .NET toolkit for evaluating AI agents.",
    RelevantDocumentIds = ["doc-agenteval-readme", "doc-agenteval-quickstart"],
    RetrievedDocumentIds = ["doc-agenteval-readme", "doc-other", "doc-quickstart"]
};

// Run all metrics
Console.WriteLine("Metric                    Score  Passed");
Console.WriteLine("-----------------------------------------");

foreach (var metric in metrics)
{
    var result = await metric.EvaluateAsync(context);
    var status = result.Passed ? "PASS" : "FAIL";
    Console.WriteLine($"{metric.Name,-25} {result.Score,5:F0}  {status}");
}

Sample Output:

Metric                    Score  Passed
-----------------------------------------
code_recall_at_k            100  PASS
code_mrr                    100  PASS
embed_answer_similarity      94  PASS
embed_response_context       89  PASS
embed_query_context          85  PASS
llm_faithfulness             95  PASS
llm_relevance                90  PASS
llm_context_precision        88  PASS
llm_context_recall           92  PASS
llm_answer_correctness       96  PASS

Data Requirements

Metric	Input	Output	Context	Ground Truth	Doc IDs
`code_recall_at_k`	-	-	-	-	Yes
`code_mrr`	-	-	-	-	Yes
`embed_answer_similarity`	-	Yes	-	Yes	-
`embed_response_context`	-	Yes	Yes	-	-
`embed_query_context`	Yes	-	Yes	-	-
`llm_faithfulness`	Yes	Yes	Yes	-	-
`llm_relevance`	Yes	Yes	-	-	-
`llm_context_precision`	Yes	-	Yes	-	-
`llm_context_recall`	Yes	-	Yes	Yes	-
`llm_answer_correctness`	Yes	Yes	-	Yes	-
`llm_groundedness`	Yes	Yes	Yes	-	-
`llm_coherence`	-	Yes	-	-	-
`llm_fluency`	-	Yes	-	-	-

Quality Metrics ($$$)

LLM-based metrics for evaluating response safety, coherence, and language quality.

llm_groundedness

Purpose: Detects unsubstantiated claims, fabricated sources, or speculation presented as fact.

Property	Value
Interface	`IRAGMetric`, `ISafetyMetric`
Requires	Context
Cost	~$0.01-0.05/eval

When to Use:

Safety validation for AI responses
Detecting fabricated citations or statistics
High-stakes applications requiring factual accuracy
Catching "confident hallucinations"

Example:

var metric = new GroundednessMetric(chatClient);

var context = new EvaluationContext
{
    Input = "What were our Q3 sales?",
    Output = "According to the Q3 report, sales increased 15% to $2.3M.",
    Context = "Q3 sales grew 15% year-over-year."  // Note: $2.3M not mentioned
};

var result = await metric.EvaluateAsync(context);
// Lower score - specific number ($2.3M) not grounded in context

llm_coherence

Purpose: Measures logical coherence and internal consistency - no self-contradictions.

Property	Value
Interface	`IRAGMetric`, `IQualityMetric`
Requires	Output only
Cost	~$0.01-0.05/eval

When to Use:

Detecting self-contradictions in complex responses
Evaluating logical flow of multi-step explanations
Quality assurance for long-form content
Ensuring ideas connect naturally

Example:

var metric = new CoherenceMetric(chatClient);

var context = new EvaluationContext
{
    Input = "Explain the deployment process",
    Output = "First, build the project. Then test locally. The first step is testing..."
};

var result = await metric.EvaluateAsync(context);
// Low score - contradictory: "first" used for both build and testing

llm_fluency

Purpose: Evaluates grammar, readability, and natural language quality.

Property	Value
Interface	`IRAGMetric`, `IQualityMetric`
Requires	Output only
Cost	~$0.01-0.05/eval

When to Use:

Customer-facing content validation
Grammar and style checking
Readability assessment
Non-native language output quality

Example:

var metric = new FluencyMetric(chatClient);

var context = new EvaluationContext
{
    Output = "The product is working very good and has many feature for user."
};

var result = await metric.EvaluateAsync(context);
// Moderate score - grammar issues: "very good" → "very well", "feature" → "features"

Table of Contents

RAG Metrics Guide

Overview

Metric Categories

When to Use Each

Information Retrieval Metrics (FREE)

code_recall_at_k

code_mrr

Embedding Metrics ($)

embed_answer_similarity

embed_response_context

embed_query_context

LLM-Based Metrics ($$$)

llm_faithfulness

llm_relevance

llm_context_precision

llm_context_recall

llm_answer_correctness

Cost Optimization Strategy

Tiered Evaluation Approach

Fast-Then-Deep Pattern

Sampling for Production

Complete RAG Evaluation Example

Data Requirements

Quality Metrics ($$$)

llm_groundedness

llm_coherence

llm_fluency

See Also