Skip to content

LLM-as-Judge Metrics

24 metrics that use an LLM to evaluate outputs. Requires an API key or local Ollama.

RAG & Context Metrics

hallucination

Checks if the output contains claims not supported by the context.

check.hallucination(output, context="source text")

faithfulness

Checks if the answer is faithful to the provided context (RAG-specific).

check.faithfulness(output, context="source text", query="user question")

context_relevance

Checks if the retrieved context is relevant to the query.

check.context_relevance(context="retrieved text", query="user question")

answer_completeness

Checks if the answer fully addresses the query.

check.answer_completeness(output, query="user question")

groundedness

Claim-by-claim verification against multiple sources.

check.groundedness(output, sources=["source1", "source2"])

contextual_precision

Checks if the most relevant documents are ranked higher.

check.contextual_precision(output, context=["doc1", "doc2"], query="q", expected="answer")

contextual_recall

Checks if all claims in the ground truth are supported by context.

check.contextual_recall(output, context=["doc1", "doc2"], expected="ground truth")

Quality Metrics

relevance

Query-output relevance scoring.

check.relevance(output, query="user question")

fluency

Writing quality and naturalness.

check.fluency(output)

coherence

Logical structure and consistency.

check.coherence(output)

correctness

Semantic comparison to expected output.

check.correctness(output, expected="expected answer")

consistency

Multi-output consistency across multiple runs.

check.consistency([output1, output2, output3])

instruction_following

Compliance with format, style, and constraint instructions.

check.instruction_following(output, instructions="Respond in bullet points under 100 words")

summarization

Summary accuracy, conciseness, and retention.

check.summarization(output, source="original text")

Safety Metrics

toxicity

Harmful content detection.

check.toxicity(output)

bias

Demographic, cultural, gender, and racial bias detection.

check.bias(output, categories=["gender", "racial"])

sentiment

Tone and mood assessment.

check.sentiment(output, threshold=0.6)

Custom Evaluation

rubric

Evaluate against custom criteria.

check.rubric(output, criteria="concise, mentions key findings, uses formal tone")

g_eval

Chain-of-thought evaluation with custom criteria and steps. Uses LLM reasoning to score outputs step by step.

check.g_eval(output, criteria="accuracy", steps=["Check facts", "Verify claims"])

Agent & Conversation Metrics

task_completion

Goal accomplishment check.

check.task_completion(output, task_description="Search the database", constraints=["under 5 seconds"])

role_adherence

Persona consistency.

check.role_adherence(output, role_description="friendly customer support agent")

tool_accuracy

Agent tool selection evaluation.

check.tool_accuracy(output, expected_tools=[{"name": "search"}], query="find records")

knowledge_retention

Multi-turn conversation memory.

check.knowledge_retention(output, conversation=[...], key_facts=["user name is Alice"])

conversation_completeness

Multi-turn request fulfillment.

check.conversation_completeness(output, conversation=[...], requirements=["answered all questions"])