LLM-as-Judge Metrics¶

24 metrics that use an LLM to evaluate outputs. Requires an API key or local Ollama.

RAG & Context Metrics¶

hallucination¶

Checks if the output contains claims not supported by the context.

check.hallucination(output, context="source text")

faithfulness¶

Checks if the answer is faithful to the provided context (RAG-specific).

check.faithfulness(output, context="source text", query="user question")

context_relevance¶

Checks if the retrieved context is relevant to the query.

check.context_relevance(context="retrieved text", query="user question")

answer_completeness¶

Checks if the answer fully addresses the query.

check.answer_completeness(output, query="user question")

groundedness¶

Claim-by-claim verification against multiple sources.

check.groundedness(output, sources=["source1", "source2"])

contextual_precision¶

Checks if the most relevant documents are ranked higher.

check.contextual_precision(output, context=["doc1", "doc2"], query="q", expected="answer")

contextual_recall¶

Checks if all claims in the ground truth are supported by context.

check.contextual_recall(output, context=["doc1", "doc2"], expected="ground truth")

Quality Metrics¶

relevance¶

Query-output relevance scoring.

check.relevance(output, query="user question")

fluency¶

Writing quality and naturalness.

check.fluency(output)

coherence¶

Logical structure and consistency.

check.coherence(output)

correctness¶

Semantic comparison to expected output.

check.correctness(output, expected="expected answer")

consistency¶

Multi-output consistency across multiple runs.

check.consistency([output1, output2, output3])

instruction_following¶

Compliance with format, style, and constraint instructions.

check.instruction_following(output, instructions="Respond in bullet points under 100 words")

summarization¶

Summary accuracy, conciseness, and retention.

check.summarization(output, source="original text")

Safety Metrics¶

toxicity¶

Harmful content detection.

check.toxicity(output)

bias¶

Demographic, cultural, gender, and racial bias detection.

check.bias(output, categories=["gender", "racial"])

sentiment¶

Tone and mood assessment.

check.sentiment(output, threshold=0.6)

Custom Evaluation¶

rubric¶

Evaluate against custom criteria.

check.rubric(output, criteria="concise, mentions key findings, uses formal tone")

g_eval¶

Chain-of-thought evaluation with custom criteria and steps. Uses LLM reasoning to score outputs step by step.

check.g_eval(output, criteria="accuracy", steps=["Check facts", "Verify claims"])

Agent & Conversation Metrics¶

task_completion¶

Goal accomplishment check.

check.task_completion(output, task_description="Search the database", constraints=["under 5 seconds"])

role_adherence¶

Persona consistency.

check.role_adherence(output, role_description="friendly customer support agent")

tool_accuracy¶

Agent tool selection evaluation.

check.tool_accuracy(output, expected_tools=[{"name": "search"}], query="find records")

knowledge_retention¶

Multi-turn conversation memory.

check.knowledge_retention(output, conversation=[...], key_facts=["user name is Alice"])

conversation_completeness¶

Multi-turn request fulfillment.

check.conversation_completeness(output, conversation=[...], requirements=["answered all questions"])