Implementing Embeddings
Technical guide for implementing
IAgentEvalEmbeddingsand using embedding utilities
Looking for when to use embedding metrics? See RAG Metrics Guide
Want to see embedding metrics in action? Check out Sample 12: Embedding Metrics for semantic similarity evaluation examples.
Overview
AgentEval includes embedding-based metrics that compute semantic similarity using vector embeddings. These metrics are significantly faster and cheaper than LLM-based evaluation, making them ideal for:
- High-volume evaluations
- Continuous integration pipelines
- Quick feedback during development
- Baseline comparison before more expensive LLM evaluation
Available Metrics
| Metric | Compares | Use Case |
|---|---|---|
AnswerSimilarityMetric |
Response ↔ Ground Truth | Is the answer semantically correct? |
ResponseContextSimilarityMetric |
Response ↔ Context | Is the response grounded in context? |
QueryContextSimilarityMetric |
Query ↔ Context | Is the retrieved context relevant? |
How It Works
All embedding metrics follow the same pattern:
- Generate embeddings for two pieces of text
- Compute cosine similarity between the vectors (range: 0-1)
- Convert to 0-100 score using
ScoreNormalizer.FromSimilarity() - Compare against threshold to determine pass/fail
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐
│ Text A │───▶│ Embedding │───▶│ Cosine │───▶│ Score │
│ Text B │───▶│ Generator │───▶│ Similarity │───▶│ 0-100 │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────┘
AnswerSimilarityMetric
Measures semantic similarity between the agent's response and the expected ground truth answer.
When to Use
- You have ground truth answers for your test cases
- Speed and cost are priorities
- Semantic meaning matters more than exact wording
Limitations
- Cannot detect subtle factual errors (e.g., wrong numbers)
- May miss negation differences ("is" vs "is not")
- Requires ground truth to be available
Example
using AgentEval.Embeddings;
using AgentEval.Metrics.RAG;
using AgentEval.Core;
// Create embedding generator (implement IAgentEvalEmbeddings)
var embeddings = new OpenAIEmbeddings("text-embedding-3-small");
// Create metric with default threshold (70)
var metric = new AnswerSimilarityMetric(embeddings);
// Or with custom threshold
var strictMetric = new AnswerSimilarityMetric(embeddings, passingThreshold: 85);
// Evaluate
var context = new EvaluationContext
{
Input = "What is the capital of France?",
Output = "Paris is the capital of France.",
GroundTruth = "The capital of France is Paris."
};
var result = await metric.EvaluateAsync(context);
Console.WriteLine($"Score: {result.Score}"); // e.g., 92.5
Console.WriteLine($"Passed: {result.Passed}"); // true
Console.WriteLine($"Similarity: {result.Details["cosineSimilarity"]}"); // e.g., 0.925
📖 Complete Example: See Sample 12: Embedding Metrics for a complete implementation showing all three embedding metrics with real evaluation scenarios.
Result Details
| Property | Description |
|---|---|
cosineSimilarity |
Raw cosine similarity (0-1) |
interpretation |
Human-readable interpretation ("Excellent", "Good", etc.) |
groundTruthLength |
Character length of ground truth |
outputLength |
Character length of response |
ResponseContextSimilarityMetric
Measures how semantically similar the agent's response is to the retrieved context. This is a fast grounding check.
When to Use
- Checking if RAG responses use the provided context
- Quick grounding validation without LLM calls
- Detecting when responses hallucinate beyond context
Default Threshold
Uses a lower default threshold (50) because responses may legitimately extend beyond the context with reasoning and formatting.
Example
var metric = new ResponseContextSimilarityMetric(embeddings);
var context = new EvaluationContext
{
Input = "What products does Contoso sell?",
Context = "Contoso Corporation is a leading manufacturer of widgets and gadgets. They offer three product lines: basic widgets, premium gadgets, and enterprise solutions.",
Output = "Contoso sells widgets and gadgets across three product lines."
};
var result = await metric.EvaluateAsync(context);
if (result.Passed)
{
Console.WriteLine("Response is grounded in context");
}
else
{
Console.WriteLine($"Warning: Response may contain hallucinations. Similarity: {result.Details["cosineSimilarity"]:P1}");
}
QueryContextSimilarityMetric
Measures how relevant the retrieved context is to the user's query. Useful for evaluating retrieval quality without running the full RAG pipeline.
When to Use
- Evaluating retriever quality
- Debugging poor RAG responses (is the context even relevant?)
- Retrieval system benchmarks
Example
var metric = new QueryContextSimilarityMetric(embeddings);
var context = new EvaluationContext
{
Input = "How do I reset my password?",
Context = "To reset your password, navigate to Settings > Security > Password Reset. Click 'Reset Password' and follow the instructions sent to your email.",
Output = "" // Not needed for this metric
};
var result = await metric.EvaluateAsync(context);
Console.WriteLine($"Context relevance: {result.Score:F1}%");
EmbeddingSimilarity Utilities
The EmbeddingSimilarity static class provides low-level utilities for working with embeddings:
using AgentEval.Embeddings;
// Get embeddings from your provider
var embedding1 = await embeddings.GetEmbeddingAsync("Hello world");
var embedding2 = await embeddings.GetEmbeddingAsync("Hi there");
// Cosine similarity (most common)
float similarity = EmbeddingSimilarity.CosineSimilarity(
embedding1.Span,
embedding2.Span);
// Range: [-1, 1], typically [0, 1] for text
// Euclidean distance
float distance = EmbeddingSimilarity.EuclideanDistance(
embedding1.Span,
embedding2.Span);
// Range: [0, ∞), 0 = identical
// Dot product
float dotProduct = EmbeddingSimilarity.DotProduct(
embedding1.Span,
embedding2.Span);
// Unnormalized similarity measure
Batch Operations
// Compare query against multiple candidates
var candidates = new List<ReadOnlyMemory<float>>
{
doc1Embedding,
doc2Embedding,
doc3Embedding
};
float[] similarities = EmbeddingSimilarity.ComputeSimilarities(
queryEmbedding,
candidates);
// Find top-k most similar items
var documents = new[]
{
(doc1, doc1Embedding),
(doc2, doc2Embedding),
(doc3, doc3Embedding)
};
var topResults = EmbeddingSimilarity.TopK(queryEmbedding, documents, topK: 2);
foreach (var (doc, similarity) in topResults)
{
Console.WriteLine($"{doc.Title}: {similarity:P1}");
}
Implementing IAgentEvalEmbeddings
To use embedding metrics, implement the IAgentEvalEmbeddings interface:
public interface IAgentEvalEmbeddings
{
/// <summary>
/// Get embeddings for a single text.
/// </summary>
Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
string text,
CancellationToken cancellationToken = default);
/// <summary>
/// Get embeddings for multiple texts (batch).
/// </summary>
Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
IEnumerable<string> texts,
CancellationToken cancellationToken = default);
}
DI Registration
IAgentEvalEmbeddings is automatically registered in DI when you call services.AddAgentEval(), provided you register an IEmbeddingGenerator<string, Embedding<float>> first. It wraps the generator with MEAIEmbeddingAdapter. You can also register your own implementation before AddAgentEval() and it will be preserved (TryAdd semantics):
services.AddSingleton<IEmbeddingGenerator<string, Embedding<float>>>(generator);
services.AddAgentEval(); // IAgentEvalEmbeddings now available via DI
// Or register your own:
services.AddSingleton<IAgentEvalEmbeddings>(new MyCustomEmbeddings());
services.AddAgentEval(); // Your registration wins
OpenAI Example
public class OpenAIEmbeddings : IAgentEvalEmbeddings
{
private readonly IEmbeddingGenerator<string, Embedding<float>> _generator;
public OpenAIEmbeddings(IChatClient chatClient, string model = "text-embedding-3-small")
{
_generator = chatClient.AsEmbeddingGenerator(model);
}
public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
string text,
CancellationToken ct = default)
{
var result = await _generator.GenerateEmbeddingAsync(text, cancellationToken: ct);
return result.Vector;
}
public async Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
IEnumerable<string> texts,
CancellationToken ct = default)
{
var results = await _generator.GenerateEmbeddingsAsync(texts.ToList(), cancellationToken: ct);
return results.Select(r => r.Vector).ToList();
}
}
Azure AI Example
public class AzureEmbeddings : IAgentEvalEmbeddings
{
private readonly EmbeddingsClient _client;
public AzureEmbeddings(string endpoint, string key, string deploymentName)
{
var credential = new AzureKeyCredential(key);
_client = new EmbeddingsClient(new Uri(endpoint), credential, deploymentName);
}
public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(
string text,
CancellationToken ct = default)
{
var response = await _client.EmbedAsync(text, ct);
return response.Value.Data[0].Embedding.ToArray();
}
public async Task<IReadOnlyList<ReadOnlyMemory<float>>> GetEmbeddingsAsync(
IEnumerable<string> texts,
CancellationToken ct = default)
{
var response = await _client.EmbedAsync(texts.ToList(), ct);
return response.Value.Data.Select(d => (ReadOnlyMemory<float>)d.Embedding.ToArray()).ToList();
}
}
Score Interpretation
Embedding similarity scores are normalized to 0-100 scale:
| Score Range | Interpretation | Typical Meaning |
|---|---|---|
| 90-100 | Excellent | Nearly identical meaning |
| 80-89 | Good | Very similar meaning |
| 70-79 | Acceptable | Related meaning |
| 60-69 | Marginal | Somewhat related |
| 50-59 | Poor | Weakly related |
| 0-49 | Fail | Different topics |
The ScoreNormalizer class handles this conversion:
// Convert cosine similarity (0-1) to score (0-100)
double score = ScoreNormalizer.FromSimilarity(0.85); // Returns 85.0
// Get human-readable interpretation
string interpretation = ScoreNormalizer.Interpret(85); // Returns "Good"
Performance Considerations
Speed
Embedding metrics are 10-100x faster than LLM-based metrics:
| Metric Type | Typical Latency | Cost per Eval |
|---|---|---|
| LLM-based (GPT-4) | 1-5 seconds | $0.01-0.05 |
| Embedding-based | 50-200ms | $0.0001 |
💡 Performance Tip: Sample 12 demonstrates batch processing and caching techniques for maximum performance.
Batching
For best performance, batch embedding requests:
// Bad: Sequential calls
foreach (var testCase in testCases)
{
var embedding = await embeddings.GetEmbeddingAsync(testCase.Output);
}
// Good: Batch call
var outputs = testCases.Select(tc => tc.Output);
var allEmbeddings = await embeddings.GetEmbeddingsAsync(outputs);
Caching
Consider caching embeddings for static content:
public class CachingEmbeddings : IAgentEvalEmbeddings
{
private readonly IAgentEvalEmbeddings _inner;
private readonly Dictionary<string, ReadOnlyMemory<float>> _cache = new();
public async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(string text, CancellationToken ct)
{
if (_cache.TryGetValue(text, out var cached))
return cached;
var embedding = await _inner.GetEmbeddingAsync(text, ct);
_cache[text] = embedding;
return embedding;
}
}
Combining with LLM Metrics
Embedding metrics work well as a first-pass filter before expensive LLM evaluation:
// Quick embedding check first
var similarityMetric = new AnswerSimilarityMetric(embeddings, passingThreshold: 60);
var quickResult = await similarityMetric.EvaluateAsync(context);
if (!quickResult.Passed)
{
// Very low similarity - likely wrong, skip expensive eval
return TestResult.Fail(quickResult.Explanation);
}
// Passes quick check - run full LLM evaluation
var faithfulnessMetric = new FaithfulnessMetric(chatClient);
var fullResult = await faithfulnessMetric.EvaluateAsync(context);
See Also
- Sample 12: Embedding Metrics - Complete working example
- Architecture Overview - Understanding the metric hierarchy
- Extensibility Guide - Creating custom metrics
- Benchmarks Guide - Performance evaluation