LLM-as-Judge Metrics¶
24 metrics that use an LLM to evaluate outputs. Requires an API key or local Ollama.
RAG & Context Metrics¶
hallucination¶
Checks if the output contains claims not supported by the context.
faithfulness¶
Checks if the answer is faithful to the provided context (RAG-specific).
context_relevance¶
Checks if the retrieved context is relevant to the query.
answer_completeness¶
Checks if the answer fully addresses the query.
groundedness¶
Claim-by-claim verification against multiple sources.
contextual_precision¶
Checks if the most relevant documents are ranked higher.
contextual_recall¶
Checks if all claims in the ground truth are supported by context.
Quality Metrics¶
relevance¶
Query-output relevance scoring.
fluency¶
Writing quality and naturalness.
coherence¶
Logical structure and consistency.
correctness¶
Semantic comparison to expected output.
consistency¶
Multi-output consistency across multiple runs.
instruction_following¶
Compliance with format, style, and constraint instructions.
summarization¶
Summary accuracy, conciseness, and retention.
Safety Metrics¶
toxicity¶
Harmful content detection.
bias¶
Demographic, cultural, gender, and racial bias detection.
sentiment¶
Tone and mood assessment.
Custom Evaluation¶
rubric¶
Evaluate against custom criteria.
g_eval¶
Chain-of-thought evaluation with custom criteria and steps. Uses LLM reasoning to score outputs step by step.
Agent & Conversation Metrics¶
task_completion¶
Goal accomplishment check.
check.task_completion(output, task_description="Search the database", constraints=["under 5 seconds"])
role_adherence¶
Persona consistency.
tool_accuracy¶
Agent tool selection evaluation.
knowledge_retention¶
Multi-turn conversation memory.
conversation_completeness¶
Multi-turn request fulfillment.