Embedding-Based Metrics
Fast, cost-effective semantic similarity evaluation without LLM calls
Overview
AgentEval includes embedding-based metrics that compute semantic similarity using vector embeddings. These metrics are significantly faster and cheaper than LLM-based evaluation, making them ideal for:
- High-volume test suites
- Continuous integration pipelines
- Quick feedback during development
- Baseline comparison before more expensive LLM evaluation
Available Metrics
| Metric | Compares | Use Case |
|---|---|---|
AnswerSimilarityMetric |
Response ↔ Ground Truth | Is the answer semantically correct? |
ResponseContextSimilarityMetric |
Response ↔ Context | Is the response grounded in context? |
QueryContextSimilarityMetric |
Query ↔ Context | Is the retrieved context relevant? |
How It Works
All embedding metrics follow the same pattern:
- Generate embeddings for two pieces of text
- Compute cosine similarity between the vectors (range: 0-1)
- Convert to 0-100 score using
ScoreNormalizer.FromSimilarity() - Compare against threshold to determine pass/fail
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐
│ Text A │───▶│ Embedding │───▶│ Cosine │───▶│ Score │
│ Text B │───▶│ Generator │───▶│ Similarity │───▶│ 0-100 │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────┘
AnswerSimilarityMetric
Measures semantic similarity between the agent's response and the expected ground truth answer.
When to Use
- You have ground truth answers for your test cases
- Speed and cost are priorities
- Semantic meaning matters more than exact wording
Limitations
- Cannot detect subtle factual errors (e.g., wrong numbers)
- May miss negation differences ("is" vs "is not")
- Requires ground truth to be available
Example
using AgentEval.Embeddings;
using AgentEval.Metrics.RAG;
using AgentEval.Core;
// Create embedding generator (implement IAgentEvalEmbeddings)
var embeddings = new OpenAIEmbeddings("text-embedding-3-small");
// Create metric with default threshold (70)
var metric = new AnswerSimilarityMetric(embeddings);
// Or with custom threshold
var strictMetric = new AnswerSimilarityMetric(embeddings, passingThreshold: 85);
// Evaluate
var context = new EvaluationContext
{
Input = "What is the capital of France?",
Output = "Paris is the capital of France.",
GroundTruth = "The capital of France is Paris."
};
var result = await metric.EvaluateAsync(context);
Console.WriteLine($"Score: {result.Score}"); // e.g., 92.5
Console.WriteLine($"Passed: {result.Passed}"); // true
Console.WriteLine($"Similarity: {result.Details["cosineSimilarity"]}"); // e.g., 0.925
Result Details
| Property | Description |
|---|---|
cosineSimilarity |
Raw cosine similarity (0-1) |
interpretation |
Human-readable interpretation ("Excellent", "Good", etc.) |
groundTruthLength |
Character length of ground truth |
outputLength |
Character length of response |
ResponseContextSimilarityMetric
Measures how semantically similar the agent's response is to the retrieved context. This is a fast grounding check.
When to Use
- Checking if RAG responses use the provided context
- Quick grounding validation without LLM calls
- Detecting when responses hallucinate beyond context
Default Threshold
Uses a lower default threshold (50) because responses may legitimately extend beyond the context with reasoning and formatting.
Example
var metric = new ResponseContextSimilarityMetric(embeddings);
var context = new EvaluationContext
{
Input = "What products does Contoso sell?",
Context = "Contoso Corporation is a leading manufacturer of widgets and gadgets. They offer three product lines: basic widgets, premium gadgets, and enterprise solutions.",
Output = "Contoso sells widgets and gadgets across three product lines."
};
var result = await metric.EvaluateAsync(context);
if (result.Passed)
{
Console.WriteLine("Response is grounded in context");
}
else
{
Console.WriteLine($"Warning: Response may contain hallucinations. Similarity: {result.Details["cosineSimilarity"]:P1}");
}
QueryContextSimilarityMetric
Measures how relevant the retrieved context is to the user's query. Useful for evaluating retrieval quality without running the full RAG pipeline.
When to Use
- Evaluating retriever quality
- Debugging poor RAG responses (is the context even relevant?)
- Retrieval system benchmarks
Example
var metric = new QueryContextSimilarityMetric(embeddings);
var context = new EvaluationContext
{
Input = "How do I reset my password?",
Context = "To reset your password, navigate to Settings > Security > Password Reset. Click 'Reset Password' and follow the instructions sent to your email.",
Output = "" // Not needed for this metric
};
var result = await metric.EvaluateAsync(context);
Console.WriteLine($"Context relevance: {result.Score:F1}%");
EmbeddingSimilarity Utilities
The EmbeddingSimilarity static class provides low-level utilities for working with embeddings:
using AgentEval.Embeddings;
// Get embeddings from your provider
var embedding1 = await embeddings.GetEmbeddingAsync("Hello world");
var embedding2 = await embeddings.GetEmbeddingAsync("Hi there");
// Cosine similarity (most common)
float similarity = EmbeddingSimilarity.CosineSimilarity(
embedding1.Span,
embedding2.Span);
// Range: [-1, 1], typically [0, 1] for text
// Euclidean distance
float distance = EmbeddingSimilarity.EuclideanDistance(
embedding1.Span,
embedding2.Span);
// Range: [0, ∞), 0 = identical
// Dot product
float dotProduct = EmbeddingSimilarity.DotProduct(
embedding1.Span,
embedding2.Span);
// Unnormalized similarity measure
Batch Operations
// Compare query against multiple candidates
var candidates = new List<ReadOnlyMemory<float>>
{
doc1Embedding,
doc2Embedding,
doc3Embedding
};
float[] similarities = EmbeddingSimilarity.ComputeSimilarities(
queryEmbedding,
candidates);
// Find top-k most similar items
var documents = new[]
{
(doc1, doc1Embedding),
(doc2, doc2Embedding),
(doc3, doc3Embedding)
};
var topResults = EmbeddingSimilarity.TopK(queryEmbedding, documents, topK: 2);
foreach (var (doc, similarity) in topResults)
{
Console.WriteLine($"{doc.Title}: {similarity:P1}");
}
Implementing IAgentEvalEmbeddings
To use embedding metrics, implement the IAgentEvalEmbeddings interface:
public interface IAgentEvalEmbeddings
{
/// <summary>
/// Get embeddings for a single text.
/// </summary>
Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
string text,
CancellationToken cancellationToken = default);
/// <summary>
/// Get embeddings for multiple texts (batch).
/// </summary>
Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
IEnumerable<string> texts,
CancellationToken cancellationToken = default);
}
OpenAI Example
public class OpenAIEmbeddings : IAgentEvalEmbeddings
{
private readonly IEmbeddingGenerator<string, Embedding<float>> _generator;
public OpenAIEmbeddings(IChatClient chatClient, string model = "text-embedding-3-small")
{
_generator = chatClient.AsEmbeddingGenerator(model);
}
public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
string text,
CancellationToken ct = default)
{
var result = await _generator.GenerateEmbeddingAsync(text, cancellationToken: ct);
return result.Vector;
}
public async Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
IEnumerable<string> texts,
CancellationToken ct = default)
{
var results = await _generator.GenerateEmbeddingsAsync(texts.ToList(), cancellationToken: ct);
return results.Select(r => r.Vector).ToList();
}
}
Azure AI Example
public class AzureEmbeddings : IAgentEvalEmbeddings
{
private readonly EmbeddingsClient _client;
public AzureEmbeddings(string endpoint, string key, string deploymentName)
{
var credential = new AzureKeyCredential(key);
_client = new EmbeddingsClient(new Uri(endpoint), credential, deploymentName);
}
public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
string text,
CancellationToken ct = default)
{
var response = await _client.EmbedAsync(text, ct);
return response.Value.Data[0].Embedding.ToArray();
}
public async Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
IEnumerable<string> texts,
CancellationToken ct = default)
{
var response = await _client.EmbedAsync(texts.ToList(), ct);
return response.Value.Data.Select(d => (ReadOnlyMemory<float>)d.Embedding.ToArray()).ToList();
}
}
Score Interpretation
Embedding similarity scores are normalized to 0-100 scale:
| Score Range | Interpretation | Typical Meaning |
|---|---|---|
| 90-100 | Excellent | Nearly identical meaning |
| 80-89 | Good | Very similar meaning |
| 70-79 | Acceptable | Related meaning |
| 60-69 | Marginal | Somewhat related |
| 50-59 | Poor | Weakly related |
| 0-49 | Fail | Different topics |
The ScoreNormalizer class handles this conversion:
// Convert cosine similarity (0-1) to score (0-100)
double score = ScoreNormalizer.FromSimilarity(0.85); // Returns 85.0
// Get human-readable interpretation
string interpretation = ScoreNormalizer.Interpret(85); // Returns "Good"
Performance Considerations
Speed
Embedding metrics are 10-100x faster than LLM-based metrics:
| Metric Type | Typical Latency | Cost per Eval |
|---|---|---|
| LLM-based (GPT-4) | 1-5 seconds | $0.01-0.05 |
| Embedding-based | 50-200ms | $0.0001 |
Batching
For best performance, batch embedding requests:
// Bad: Sequential calls
foreach (var testCase in testCases)
{
var embedding = await embeddings.GetEmbeddingAsync(testCase.Output);
}
// Good: Batch call
var outputs = testCases.Select(tc => tc.Output);
var allEmbeddings = await embeddings.GetEmbeddingsAsync(outputs);
Caching
Consider caching embeddings for static content:
public class CachingEmbeddings : IAgentEvalEmbeddings
{
private readonly IAgentEvalEmbeddings _inner;
private readonly Dictionary<string, ReadOnlyMemory<float>> _cache = new();
public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(string text, CancellationToken ct)
{
if (_cache.TryGetValue(text, out var cached))
return cached;
var embedding = await _inner.GetEmbeddingAsync(text, ct);
_cache[text] = embedding;
return embedding;
}
}
Combining with LLM Metrics
Embedding metrics work well as a first-pass filter before expensive LLM evaluation:
// Quick embedding check first
var similarityMetric = new AnswerSimilarityMetric(embeddings, passingThreshold: 60);
var quickResult = await similarityMetric.EvaluateAsync(context);
if (!quickResult.Passed)
{
// Very low similarity - likely wrong, skip expensive eval
return TestResult.Fail(quickResult.Explanation);
}
// Passes quick check - run full LLM evaluation
var faithfulnessMetric = new FaithfulnessMetric(chatClient);
var fullResult = await faithfulnessMetric.EvaluateAsync(context);
See Also
- Architecture Overview - Understanding the metric hierarchy
- Extensibility Guide - Creating custom metrics
- Benchmarks Guide - Performance testing