Implementing Embeddings

Technical guide for implementing IAgentEvalEmbeddings and using embedding utilities

Looking for when to use embedding metrics? See RAG Metrics Guide

Want to see embedding metrics in action? Check out Sample 12: Embedding Metrics for semantic similarity evaluation examples.

Overview

AgentEval includes embedding-based metrics that compute semantic similarity using vector embeddings. These metrics are significantly faster and cheaper than LLM-based evaluation, making them ideal for:

High-volume evaluations
Continuous integration pipelines
Quick feedback during development
Baseline comparison before more expensive LLM evaluation

Available Metrics

Metric	Compares	Use Case
`AnswerSimilarityMetric`	Response ↔ Ground Truth	Is the answer semantically correct?
`ResponseContextSimilarityMetric`	Response ↔ Context	Is the response grounded in context?
`QueryContextSimilarityMetric`	Query ↔ Context	Is the retrieved context relevant?

How It Works

All embedding metrics follow the same pattern:

Generate embeddings for two pieces of text
Compute cosine similarity between the vectors (range: 0-1)
Convert to 0-100 score using ScoreNormalizer.FromSimilarity()
Compare against threshold to determine pass/fail

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────┐
│   Text A    │───▶│  Embedding  │───▶│   Cosine    │───▶│  Score  │
│   Text B    │───▶│  Generator  │───▶│ Similarity  │───▶│  0-100  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────┘

AnswerSimilarityMetric

Measures semantic similarity between the agent's response and the expected ground truth answer.

When to Use

You have ground truth answers for your test cases
Speed and cost are priorities
Semantic meaning matters more than exact wording

Limitations

Cannot detect subtle factual errors (e.g., wrong numbers)
May miss negation differences ("is" vs "is not")
Requires ground truth to be available

Example

using AgentEval.Embeddings;
using AgentEval.Metrics.RAG;
using AgentEval.Core;

// Create embedding generator (implement IAgentEvalEmbeddings)
var embeddings = new OpenAIEmbeddings("text-embedding-3-small");

// Create metric with default threshold (70)
var metric = new AnswerSimilarityMetric(embeddings);

// Or with custom threshold
var strictMetric = new AnswerSimilarityMetric(embeddings, passingThreshold: 85);

// Evaluate
var context = new EvaluationContext
{
    Input = "What is the capital of France?",
    Output = "Paris is the capital of France.",
    GroundTruth = "The capital of France is Paris."
};

var result = await metric.EvaluateAsync(context);

Console.WriteLine($"Score: {result.Score}");           // e.g., 92.5
Console.WriteLine($"Passed: {result.Passed}");         // true
Console.WriteLine($"Similarity: {result.Details["cosineSimilarity"]}");  // e.g., 0.925

📖 Complete Example: See Sample 12: Embedding Metrics for a complete implementation showing all three embedding metrics with real evaluation scenarios.

Result Details

Property	Description
`cosineSimilarity`	Raw cosine similarity (0-1)
`interpretation`	Human-readable interpretation ("Excellent", "Good", etc.)
`groundTruthLength`	Character length of ground truth
`outputLength`	Character length of response

ResponseContextSimilarityMetric

Measures how semantically similar the agent's response is to the retrieved context. This is a fast grounding check.

When to Use

Checking if RAG responses use the provided context
Quick grounding validation without LLM calls
Detecting when responses hallucinate beyond context

Default Threshold

Uses a lower default threshold (50) because responses may legitimately extend beyond the context with reasoning and formatting.

Example

var metric = new ResponseContextSimilarityMetric(embeddings);

var context = new EvaluationContext
{
    Input = "What products does Contoso sell?",
    Context = "Contoso Corporation is a leading manufacturer of widgets and gadgets. They offer three product lines: basic widgets, premium gadgets, and enterprise solutions.",
    Output = "Contoso sells widgets and gadgets across three product lines."
};

var result = await metric.EvaluateAsync(context);

if (result.Passed)
{
    Console.WriteLine("Response is grounded in context");
}
else
{
    Console.WriteLine($"Warning: Response may contain hallucinations. Similarity: {result.Details["cosineSimilarity"]:P1}");
}

QueryContextSimilarityMetric

Measures how relevant the retrieved context is to the user's query. Useful for evaluating retrieval quality without running the full RAG pipeline.

When to Use

Evaluating retriever quality
Debugging poor RAG responses (is the context even relevant?)
Retrieval system benchmarks

Example

var metric = new QueryContextSimilarityMetric(embeddings);

var context = new EvaluationContext
{
    Input = "How do I reset my password?",
    Context = "To reset your password, navigate to Settings > Security > Password Reset. Click 'Reset Password' and follow the instructions sent to your email.",
    Output = "" // Not needed for this metric
};

var result = await metric.EvaluateAsync(context);

Console.WriteLine($"Context relevance: {result.Score:F1}%");

EmbeddingSimilarity Utilities

The EmbeddingSimilarity static class provides low-level utilities for working with embeddings:

using AgentEval.Embeddings;

// Get embeddings from your provider
var embedding1 = await embeddings.GetEmbeddingAsync("Hello world");
var embedding2 = await embeddings.GetEmbeddingAsync("Hi there");

// Cosine similarity (most common)
float similarity = EmbeddingSimilarity.CosineSimilarity(
    embedding1.Span, 
    embedding2.Span);
// Range: [-1, 1], typically [0, 1] for text

// Euclidean distance
float distance = EmbeddingSimilarity.EuclideanDistance(
    embedding1.Span, 
    embedding2.Span);
// Range: [0, ∞), 0 = identical

// Dot product
float dotProduct = EmbeddingSimilarity.DotProduct(
    embedding1.Span, 
    embedding2.Span);
// Unnormalized similarity measure

Batch Operations

// Compare query against multiple candidates
var candidates = new List<ReadOnlyMemory<float>> 
{ 
    doc1Embedding, 
    doc2Embedding, 
    doc3Embedding 
};

float[] similarities = EmbeddingSimilarity.ComputeSimilarities(
    queryEmbedding, 
    candidates);

// Find top-k most similar items
var documents = new[]
{
    (doc1, doc1Embedding),
    (doc2, doc2Embedding),
    (doc3, doc3Embedding)
};

var topResults = EmbeddingSimilarity.TopK(queryEmbedding, documents, topK: 2);

foreach (var (doc, similarity) in topResults)
{
    Console.WriteLine($"{doc.Title}: {similarity:P1}");
}

Implementing IAgentEvalEmbeddings

To use embedding metrics, implement the IAgentEvalEmbeddings interface:

public interface IAgentEvalEmbeddings
{
    /// <summary>
    /// Get embeddings for a single text.
    /// </summary>
    Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
        string text, 
        CancellationToken cancellationToken = default);
    
    /// <summary>
    /// Get embeddings for multiple texts (batch).
    /// </summary>
    Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
        IEnumerable<string> texts, 
        CancellationToken cancellationToken = default);
}

DI Registration

IAgentEvalEmbeddings is automatically registered in DI when you call services.AddAgentEval(), provided you register an IEmbeddingGenerator<string, Embedding<float>> first. It wraps the generator with MEAIEmbeddingAdapter. You can also register your own implementation before AddAgentEval() and it will be preserved (TryAdd semantics):

services.AddSingleton<IEmbeddingGenerator<string, Embedding<float>>>(generator);
services.AddAgentEval(); // IAgentEvalEmbeddings now available via DI

// Or register your own:
services.AddSingleton<IAgentEvalEmbeddings>(new MyCustomEmbeddings());
services.AddAgentEval(); // Your registration wins

OpenAI Example

public class OpenAIEmbeddings : IAgentEvalEmbeddings
{
    private readonly IEmbeddingGenerator<string, Embedding<float>> _generator;
    
    public OpenAIEmbeddings(IChatClient chatClient, string model = "text-embedding-3-small")
    {
        _generator = chatClient.AsEmbeddingGenerator(model);
    }
    
    public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
        string text, 
        CancellationToken ct = default)
    {
        var result = await _generator.GenerateEmbeddingAsync(text, cancellationToken: ct);
        return result.Vector;
    }
    
    public async Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
        IEnumerable<string> texts, 
        CancellationToken ct = default)
    {
        var results = await _generator.GenerateEmbeddingsAsync(texts.ToList(), cancellationToken: ct);
        return results.Select(r => r.Vector).ToList();
    }
}

Azure AI Example

public class AzureEmbeddings : IAgentEvalEmbeddings
{
    private readonly EmbeddingsClient _client;
    
    public AzureEmbeddings(string endpoint, string key, string deploymentName)
    {
        var credential = new AzureKeyCredential(key);
        _client = new EmbeddingsClient(new Uri(endpoint), credential, deploymentName);
    }
    
    public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
        string text, 
        CancellationToken ct = default)
    {
        var response = await _client.EmbedAsync(text, ct);
        return response.Value.Data[0].Embedding.ToArray();
    }
    
    public async Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
        IEnumerable<string> texts, 
        CancellationToken ct = default)
    {
        var response = await _client.EmbedAsync(texts.ToList(), ct);
        return response.Value.Data.Select(d => (ReadOnlyMemory<float>)d.Embedding.ToArray()).ToList();
    }
}

Score Interpretation

Embedding similarity scores are normalized to 0-100 scale:

Score Range	Interpretation	Typical Meaning
90-100	Excellent	Nearly identical meaning
80-89	Good	Very similar meaning
70-79	Acceptable	Related meaning
60-69	Marginal	Somewhat related
50-59	Poor	Weakly related
0-49	Fail	Different topics

The ScoreNormalizer class handles this conversion:

// Convert cosine similarity (0-1) to score (0-100)
double score = ScoreNormalizer.FromSimilarity(0.85);  // Returns 85.0

// Get human-readable interpretation
string interpretation = ScoreNormalizer.Interpret(85);  // Returns "Good"

Performance Considerations

Speed

Embedding metrics are 10-100x faster than LLM-based metrics:

Metric Type	Typical Latency	Cost per Eval
LLM-based (GPT-4)	1-5 seconds	$0.01-0.05
Embedding-based	50-200ms	$0.0001

💡 Performance Tip: Sample 12 demonstrates batch processing and caching techniques for maximum performance.

Batching

For best performance, batch embedding requests:

// Bad: Sequential calls
foreach (var testCase in testCases)
{
    var embedding = await embeddings.GetEmbeddingAsync(testCase.Output);
}

// Good: Batch call
var outputs = testCases.Select(tc => tc.Output);
var allEmbeddings = await embeddings.GetEmbeddingsAsync(outputs);

Caching

Consider caching embeddings for static content:

public class CachingEmbeddings : IAgentEvalEmbeddings
{
    private readonly IAgentEvalEmbeddings _inner;
    private readonly Dictionary<string, ReadOnlyMemory<float>> _cache = new();
    
    public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(string text, CancellationToken ct)
    {
        if (_cache.TryGetValue(text, out var cached))
            return cached;
        
        var embedding = await _inner.GetEmbeddingAsync(text, ct);
        _cache[text] = embedding;
        return embedding;
    }
}

Combining with LLM Metrics

Embedding metrics work well as a first-pass filter before expensive LLM evaluation:

// Quick embedding check first
var similarityMetric = new AnswerSimilarityMetric(embeddings, passingThreshold: 60);
var quickResult = await similarityMetric.EvaluateAsync(context);

if (!quickResult.Passed)
{
    // Very low similarity - likely wrong, skip expensive eval
    return TestResult.Fail(quickResult.Explanation);
}

// Passes quick check - run full LLM evaluation
var faithfulnessMetric = new FaithfulnessMetric(chatClient);
var fullResult = await faithfulnessMetric.EvaluateAsync(context);

Table of Contents

Implementing Embeddings

Overview

Available Metrics

How It Works

AnswerSimilarityMetric

When to Use

Limitations

Example

Result Details

ResponseContextSimilarityMetric

When to Use

Default Threshold

Example

QueryContextSimilarityMetric

When to Use

Example

EmbeddingSimilarity Utilities

Batch Operations

Implementing IAgentEvalEmbeddings

DI Registration

OpenAI Example

Azure AI Example

Score Interpretation

Performance Considerations

Speed

Batching

Caching

Combining with LLM Metrics

See Also