Table of Contents

RAG Metrics Guide

Comprehensive evaluation for Retrieval-Augmented Generation systems


Overview

AgentEval provides 13 metrics specifically designed for evaluating RAG pipelines, covering the entire retrieval to generation to evaluation cycle.

Metric Categories

Category Metrics Cost Best For
LLM-based Faithfulness, Relevance, Context Precision/Recall, Answer Correctness $$$ High-accuracy evaluation, production sampling
Quality Groundedness, Coherence, Fluency $$$ Safety validation, language quality
Embedding-based Answer Similarity, Response-Context, Query-Context $ Volume testing, fast feedback
Information Retrieval Recall@K, MRR FREE CI/CD, retrieval optimization

When to Use Each

RAG Pipeline Stage          Recommended Metrics
---------------------------------------------------------
                           +-----------------------------+
  Query                    |                             |
    |                      |                             |
    v                      |                             |
+-----------+              |  code_recall_at_k (FREE)    |
| Retrieval |--------------|  code_mrr (FREE)            |
+-----------+              |  embed_query_context ($)    |
    |                      |  llm_context_precision ($$$)|
    |                      |  llm_context_recall ($$$)   |
    v                      |                             |
+-----------+              |                             |
| Generation|--------------|  llm_faithfulness ($$$)     |
+-----------+              |  embed_response_context ($) |
    |                      |                             |
    v                      |                             |
+-----------+              |  llm_answer_correctness ($$$|
|  Answer   |--------------|  embed_answer_similarity ($)|
+-----------+              |  llm_relevance ($$$)        |
                           +-----------------------------+

Information Retrieval Metrics (FREE)

These metrics evaluate retrieval quality using only document IDs - no API calls required.

code_recall_at_k

Purpose: Measures what proportion of relevant documents were found in the top K results.

Formula: Recall@K = |Relevant INTERSECT Retrieved@K| / |Relevant|

Property Value
Interface IRAGMetric
Cost FREE
Default K 10
Pass Threshold 70%

Example:

var metric = new RecallAtKMetric(k: 5);

var context = new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2", "doc3"],      // Ground truth
    RetrievedDocumentIds = ["doc1", "doc4", "doc2", "doc5", "doc6"]  // Search results
};

var result = await metric.EvaluateAsync(context);
// Score: 67 (2 of 3 relevant docs found in top 5)

AdditionalData:

  • k - The K value used
  • relevant_count - Total relevant documents
  • retrieved_at_k - Documents retrieved at K
  • relevant_found_at_k - Relevant documents found in results

code_mrr

Purpose: Mean Reciprocal Rank - measures how early the first relevant document appears.

Formula: MRR = 1 / rank_of_first_relevant

Property Value
Interface IRAGMetric
Cost FREE
Default maxRank Unlimited
Pass Threshold 33% (first relevant in top 3)

Scoring:

First Relevant Rank Score
1 100
2 50
3 33
10 10
Not found 0

Example:

var metric = new MRRMetric(maxRank: 10);

var context = new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2"],
    RetrievedDocumentIds = ["docA", "docB", "doc1", "docC"]  // doc1 at rank 3
};

var result = await metric.EvaluateAsync(context);
// Score: 33 (1/3 = 0.333)

AdditionalData:

  • first_relevant_rank - Position of first relevant doc (0 if none)
  • max_rank - Maximum rank considered
  • reciprocal_rank - The 1/rank value

Embedding Metrics ($)

Fast semantic similarity using vector embeddings - 10-100x faster than LLM evaluation.

embed_answer_similarity

Purpose: Measures semantic similarity between response and ground truth.

Property Value
Interface IRAGMetric
Requires Ground Truth
Cost ~$0.0001/eval
Default Threshold 70

When to Use:

  • You have ground truth answers
  • Speed/cost are priorities
  • Semantic meaning matters more than exact wording

Example:

var embeddings = new OpenAIEmbeddings("text-embedding-3-small");
var metric = new AnswerSimilarityMetric(embeddings, passingThreshold: 80);

var context = new EvaluationContext
{
    Output = "Paris is the capital of France.",
    GroundTruth = "The capital of France is Paris."
};

var result = await metric.EvaluateAsync(context);
// Score: ~92 (high semantic similarity)

Limitations:

  • Cannot detect subtle factual errors (wrong numbers)
  • May miss negation differences ("is" vs "is not")

embed_response_context

Purpose: Measures how grounded the response is in the retrieved context.

Property Value
Interface IRAGMetric
Requires Context
Cost ~$0.0001/eval
Default Threshold 50 (lower - responses may extend context)

When to Use:

  • Quick grounding check
  • Detecting hallucination beyond context
  • Pre-filter before expensive LLM evaluation

Example:

var metric = new ResponseContextSimilarityMetric(embeddings);

var context = new EvaluationContext
{
    Context = "Contoso sells widgets and gadgets in three product lines.",
    Output = "Contoso offers widgets, gadgets, and enterprise solutions."
};

var result = await metric.EvaluateAsync(context);
// High score = response grounded in context

embed_query_context

Purpose: Measures relevance of retrieved context to the user's query.

Property Value
Interface IRAGMetric
Requires Input, Context
Cost ~$0.0001/eval
Default Threshold 70

When to Use:

  • Evaluating retriever quality
  • Debugging poor RAG responses
  • Retrieval system benchmarks

Example:

var metric = new QueryContextSimilarityMetric(embeddings);

var context = new EvaluationContext
{
    Input = "How do I reset my password?",
    Context = "To reset your password, navigate to Settings > Security..."
};

var result = await metric.EvaluateAsync(context);
// High score = context is relevant to query

LLM-Based Metrics ($$$)

Highest accuracy evaluation using LLM-as-judge patterns.

llm_faithfulness

Purpose: Measures whether the response is grounded in context (no hallucinations).

Property Value
Interface IRAGMetric
Requires Context
Cost ~$0.01-0.05/eval

When to Use:

  • Validating RAG systems don't hallucinate
  • Production quality sampling
  • High-stakes applications

Example:

var metric = new FaithfulnessMetric(chatClient);

var context = new EvaluationContext
{
    Input = "What is the capital of France?",
    Output = "Paris is the capital of France and has a population of 2.1 million.",
    Context = "France is a country in Europe. Its capital is Paris."
};

var result = await metric.EvaluateAsync(context);
// Score depends on whether "2.1 million" is in context

llm_relevance

Purpose: Measures how relevant the response is to the user's question.

Property Value
Interface IRAGMetric
Requires Input only
Cost ~$0.01-0.05/eval

When to Use:

  • General quality checks
  • Detecting off-topic responses
  • Any LLM output evaluation

llm_context_precision

Purpose: Measures how much of the retrieved context is actually useful.

Property Value
Interface IRAGMetric
Requires Context
Cost ~$0.01-0.05/eval

When to Use:

  • Optimizing retrieval quality
  • Reducing context noise
  • Improving retrieval efficiency

llm_context_recall

Purpose: Measures whether context contains all information needed for the correct answer.

Property Value
Interface IRAGMetric
Requires Context, Ground Truth
Cost ~$0.01-0.05/eval

When to Use:

  • Evaluating retrieval completeness
  • Ensuring no critical information missed
  • Testing retrieval coverage

llm_answer_correctness

Purpose: Measures factual correctness compared to ground truth.

Property Value
Interface IRAGMetric
Requires Ground Truth
Cost ~$0.01-0.05/eval

When to Use:

  • QA system evaluation
  • Fact-checking responses
  • Regression testing with known answers

Cost Optimization Strategy

Tiered Evaluation Approach

// TIER 1: CI/CD - FREE metrics only
var ciMetrics = new IMetric[]
{
    new RecallAtKMetric(k: 10),
    new MRRMetric()
};

// TIER 2: Development - Add embedding metrics
var devMetrics = ciMetrics.Concat(new IMetric[]
{
    new AnswerSimilarityMetric(embeddings),
    new ResponseContextSimilarityMetric(embeddings)
});

// TIER 3: Production Sampling - Full LLM evaluation
var prodMetrics = devMetrics.Concat(new IMetric[]
{
    new FaithfulnessMetric(chatClient),
    new AnswerCorrectnessMetric(chatClient)
});

Fast-Then-Deep Pattern

// Quick embedding check first
var embedScore = await embedMetric.EvaluateAsync(context);

if (embedScore.Score < 60)
{
    // Very low - skip expensive LLM eval
    return TestResult.Fail("Response not grounded in context");
}

if (embedScore.Score < 85)
{
    // Borderline - escalate to LLM
    var llmScore = await faithfulnessMetric.EvaluateAsync(context);
    return llmScore;
}

return embedScore;  // High confidence pass

Sampling for Production

var sampleRate = 0.10;  // 10% of traffic

if (Random.Shared.NextDouble() < sampleRate)
{
    // Run expensive metrics on sample only
    await faithfulnessMetric.EvaluateAsync(context);
    await answerCorrectnessMetric.EvaluateAsync(context);
}

Complete RAG Evaluation Example

using AgentEval.Core;
using AgentEval.Metrics.RAG;
using AgentEval.Embeddings;

// Setup
var embeddings = new OpenAIEmbeddings("text-embedding-3-small");
var chatClient = GetAzureOpenAIChatClient();

// Define all RAG metrics
var metrics = new IMetric[]
{
    // FREE - Information Retrieval
    new RecallAtKMetric(k: 5),
    new MRRMetric(),
    
    // $ - Embedding-based
    new AnswerSimilarityMetric(embeddings),
    new ResponseContextSimilarityMetric(embeddings),
    new QueryContextSimilarityMetric(embeddings),
    
    // $$$ - LLM-based
    new FaithfulnessMetric(chatClient),
    new RelevanceMetric(chatClient),
    new ContextPrecisionMetric(chatClient),
    new ContextRecallMetric(chatClient),
    new AnswerCorrectnessMetric(chatClient)
};

// Prepare evaluation context
var context = new EvaluationContext
{
    Input = "What is AgentEval?",
    Output = "AgentEval is a .NET evaluation toolkit for AI agents.",
    Context = "AgentEval is the comprehensive .NET evaluation toolkit...",
    GroundTruth = "AgentEval is a .NET toolkit for evaluating AI agents.",
    RelevantDocumentIds = ["doc-agenteval-readme", "doc-agenteval-quickstart"],
    RetrievedDocumentIds = ["doc-agenteval-readme", "doc-other", "doc-quickstart"]
};

// Run all metrics
Console.WriteLine("Metric                    Score  Passed");
Console.WriteLine("-----------------------------------------");

foreach (var metric in metrics)
{
    var result = await metric.EvaluateAsync(context);
    var status = result.Passed ? "PASS" : "FAIL";
    Console.WriteLine($"{metric.Name,-25} {result.Score,5:F0}  {status}");
}

Sample Output:

Metric                    Score  Passed
-----------------------------------------
code_recall_at_k            100  PASS
code_mrr                    100  PASS
embed_answer_similarity      94  PASS
embed_response_context       89  PASS
embed_query_context          85  PASS
llm_faithfulness             95  PASS
llm_relevance                90  PASS
llm_context_precision        88  PASS
llm_context_recall           92  PASS
llm_answer_correctness       96  PASS

Data Requirements

Metric Input Output Context Ground Truth Doc IDs
code_recall_at_k - - - - Yes
code_mrr - - - - Yes
embed_answer_similarity - Yes - Yes -
embed_response_context - Yes Yes - -
embed_query_context Yes - Yes - -
llm_faithfulness Yes Yes Yes - -
llm_relevance Yes Yes - - -
llm_context_precision Yes - Yes - -
llm_context_recall Yes - Yes Yes -
llm_answer_correctness Yes Yes - Yes -
llm_groundedness Yes Yes Yes - -
llm_coherence - Yes - - -
llm_fluency - Yes - - -

Quality Metrics ($$$)

LLM-based metrics for evaluating response safety, coherence, and language quality.

llm_groundedness

Purpose: Detects unsubstantiated claims, fabricated sources, or speculation presented as fact.

Property Value
Interface IRAGMetric, ISafetyMetric
Requires Context
Cost ~$0.01-0.05/eval

When to Use:

  • Safety validation for AI responses
  • Detecting fabricated citations or statistics
  • High-stakes applications requiring factual accuracy
  • Catching "confident hallucinations"

Example:

var metric = new GroundednessMetric(chatClient);

var context = new EvaluationContext
{
    Input = "What were our Q3 sales?",
    Output = "According to the Q3 report, sales increased 15% to $2.3M.",
    Context = "Q3 sales grew 15% year-over-year."  // Note: $2.3M not mentioned
};

var result = await metric.EvaluateAsync(context);
// Lower score - specific number ($2.3M) not grounded in context

llm_coherence

Purpose: Measures logical coherence and internal consistency - no self-contradictions.

Property Value
Interface IRAGMetric, IQualityMetric
Requires Output only
Cost ~$0.01-0.05/eval

When to Use:

  • Detecting self-contradictions in complex responses
  • Evaluating logical flow of multi-step explanations
  • Quality assurance for long-form content
  • Ensuring ideas connect naturally

Example:

var metric = new CoherenceMetric(chatClient);

var context = new EvaluationContext
{
    Input = "Explain the deployment process",
    Output = "First, build the project. Then test locally. The first step is testing..."
};

var result = await metric.EvaluateAsync(context);
// Low score - contradictory: "first" used for both build and testing

llm_fluency

Purpose: Evaluates grammar, readability, and natural language quality.

Property Value
Interface IRAGMetric, IQualityMetric
Requires Output only
Cost ~$0.01-0.05/eval

When to Use:

  • Customer-facing content validation
  • Grammar and style checking
  • Readability assessment
  • Non-native language output quality

Example:

var metric = new FluencyMetric(chatClient);

var context = new EvaluationContext
{
    Output = "The product is working very good and has many feature for user."
};

var result = await metric.EvaluateAsync(context);
// Moderate score - grammar issues: "very good" → "very well", "feature" → "features"

See Also


Last updated: January 2026