Choosing metrics¶
checkllm ships with ~70 LLM-as-judge metrics and ~40 deterministic checks. The number is a feature, not a trap: most scenarios need only 3-5 of them. This page helps you pick the right ones fast.
The one rule
Start with deterministic checks. Add one judge metric per concern. Stop. Most production setups need 2-4 deterministic guards and 2-3 judge metrics. If you are reaching for a seventh metric, you are probably measuring the same thing twice.
TL;DR: pick by use case¶
| Use case | Essential | Nice to have | Avoid / skip |
|---|---|---|---|
| RAG | faithfulness, contextual_recall, contextual_precision, relevance |
context_entity_recall, noise_sensitivity, citation_accuracy |
fluency, coherence (rarely the problem) |
| Agents / tool use | tool_call_f1, tool_accuracy, task_completion |
plan_adherence, argument_correctness, trajectory_step_count |
bias, toxicity (usually out of scope) |
| Chatbot / multi-turn | conversation_completeness, role_adherence, knowledge_retention |
turn_relevancy, topic_adherence |
correctness (no ground truth per turn) |
| Content generation | rubric or g_eval, instruction_following |
fluency, coherence, no_pii, readability |
faithfulness (no context) |
| Code generation | code_correctness, is_valid_python/is_valid_sql, functional tests |
rubric for style, sql_equivalence for SQL |
fluency, coherence |
| Image / multimodal | image_text_alignment, multimodal_faithfulness, visual_hallucination |
ocr_accuracy, chart_value_extraction |
Text-only metrics |
| Production guardrail | no_pii, max_tokens, toxicity, is_refusal |
misuse_detection, role_violation, non_advice |
Heavy judge metrics (latency) |
| Safety review | toxicity, bias, pii_detection, misuse_detection |
role_violation, non_advice |
n/a |
| Red team | RedTeamer + toxicity, pii_detection, is_refusal |
redteam_evolver for novel attacks |
Quality metrics |
| Compliance | ComplianceScanner, pii_detection, non_advice |
Framework-specific vulnerabilities (HIPAA, GDPR, PCI-DSS, SOX) | Generic quality metrics |
Decision flowchart¶
flowchart TD
START([What are you evaluating?]) --> KIND{Kind of system?}
KIND -->|Retrieves and answers| RAG[RAG pipeline]
KIND -->|Calls tools / takes actions| AGENT[Agent / tool use]
KIND -->|Holds conversation| CHAT[Chatbot / multi-turn]
KIND -->|Generates long-form text| CONTENT[Content generation]
KIND -->|Generates code / SQL / JSON| CODE[Structured output]
KIND -->|Involves images| MULTI[Multimodal]
KIND -->|Runs in production| PROD[Runtime guardrail]
RAG --> RAG_Q{Retrieval or generation problem?}
RAG_Q -->|Retrieval| RAG_RET[contextual_recall + contextual_precision]
RAG_Q -->|Generation| RAG_GEN[faithfulness + relevance]
RAG_Q -->|Both| RAG_FULL[faithfulness + contextual_recall + contextual_precision + relevance]
AGENT --> AGENT_Q{Expected tool calls known?}
AGENT_Q -->|Yes, exact list| AGENT_DET[tool_call_f1 - deterministic]
AGENT_Q -->|Yes, approximate| AGENT_JUDGE[tool_accuracy + argument_correctness]
AGENT_Q -->|No, open-ended| AGENT_GOAL[task_completion + goal_accuracy]
CHAT --> CHAT_Q{Single concern?}
CHAT_Q -->|Task satisfaction| CHAT_COMP[conversation_completeness]
CHAT_Q -->|Stays on script| CHAT_ROLE[role_adherence + topic_adherence]
CHAT_Q -->|Remembers context| CHAT_MEM[knowledge_retention]
CONTENT --> CONTENT_Q{Style or substance?}
CONTENT_Q -->|Style / format| CONTENT_STYLE[rubric + readability + no_pii]
CONTENT_Q -->|Custom criteria| CONTENT_GEVAL[g_eval with your rubric]
CODE --> CODE_Q{Can you run tests?}
CODE_Q -->|Yes| CODE_TEST[unit tests + is_valid_python]
CODE_Q -->|No| CODE_JUDGE[code_correctness + rubric]
MULTI --> MULTI_ALIGN[image_text_alignment + visual_hallucination]
PROD --> PROD_LAT{Latency budget?}
PROD_LAT -->|< 50ms| PROD_DET[deterministic only: no_pii + max_tokens]
PROD_LAT -->|< 500ms| PROD_MIX[+ toxicity + is_refusal]
PROD_LAT -->|Batch / async| PROD_FULL[full Guard with judge checks]
Category deep dives¶
RAG and grounding¶
What it measures. Whether retrieval returned the right context and whether generation actually used it.
When to use. Any application that calls retrieve() then generate() -
knowledge-base bots, doc-Q&A, summarisers over fetched text.
Key metrics.
faithfulness- the one non-negotiable metric for RAG. Answers the question "does every claim trace back to context?". Threshold 0.85+.contextual_recall- retrieval did its job; the relevant chunks are in the context window. Needs a gold answer.contextual_precision- relevant chunks are ranked ahead of noise. Needs gold relevance labels or a query+answer pair.relevance- generated answer addresses the original query (distinct from grounding: an answer can be faithful but off-topic).context_entity_recall- for entity-heavy questions where missing a single name means the answer is wrong.noise_sensitivity- deliberately injects irrelevant chunks to see if the model gets distracted.citation_accuracy- if your UI shows[1],[2]markers, verify the marker points at the right chunk.
Common pitfalls.
- Running only
faithfulness: an unfaithful-but-retrieval-was-empty answer still passes. Pair withcontextual_recall. - Running
relevancewithoutfaithfulness: a confident hallucination is on-topic and relevant. - Using
fluencyto check RAG quality: RAG answers are almost always fluent; the bug is always grounding.
Agents and tool use¶
What it measures. Whether an agent picked the right tools, called them with correct arguments, and in a reasonable sequence.
When to use. ReAct agents, MCP clients, function-calling pipelines, LangGraph / CrewAI / custom orchestrators.
Key metrics.
tool_call_f1- deterministic F1 over tool names. Free, instant. Always run it first when you have an expected tool list.tool_accuracy- judge-based: was the right tool chosen for the query? Use when F1 is too strict (e.g., synonym tools, optional calls).argument_correctness- semantic check on arguments. Catchessearch("capital france")vssearch("capital of France").plan_adherence- if your agent emits a plan, verify the execution trace follows it.task_completion/goal_accuracy- did the agent finish the job, regardless of how?trajectory_step_count- catches loops and over-thinking.
Common pitfalls.
- Only checking
task_completion: a successful task with a bloated 40-step trajectory still passes. Pair withtrajectory_step_count. - Ignoring
argument_correctness: right tool, wrong arguments = silently broken. - Using judge metrics on trivially-checkable tool sequences: start with
tool_call_f1and only add judges for the fuzzy parts.
Conversation and multi-turn¶
What it measures. How the assistant performs over a full transcript, not just a single turn.
When to use. Customer-support bots, therapy/coaching flows, any session where state matters across turns.
Key metrics.
conversation_completeness- every user request was ultimately satisfied.role_adherence- the assistant did not break character / scope.knowledge_retention- information from turn 2 is still correct at turn 6.topic_adherence- conversation stayed within allowed topics (useful for narrow-scope bots like banking assistants).- Per-turn
turn_relevancy/turn_faithfulness/turn_coherence- granular signals when you need to find which turn broke.
Common pitfalls.
- Running per-turn metrics without holistic ones: every turn can pass while the overall conversation fails the user.
- Measuring
correctnessper turn: there is rarely a canonical per-turn ground truth.
Content generation¶
What it measures. Subjective quality of open-ended text - summaries, marketing copy, explanations.
When to use. No reference output exists, only a rubric or style guide.
Key metrics.
rubric- free-form criteria you write as plain English.g_eval- G-Eval chain-of-thought variant; more stable scores on fuzzy criteria.instruction_following- honours explicit instructions ("in 50 words, no lists, formal tone").fluency/coherence- cheap sanity checks.readability- deterministic grade-level check; great for copy aimed at non-expert readers.
Common pitfalls.
- Running every quality metric at once:
fluency,coherence,consistency,rubricall overlap. Pick one per axis. - Ignoring
instruction_following: LLMs often drop an instruction silently.
Code generation¶
What it measures. Whether generated code parses, runs, and does the right thing.
When to use. SQL generators, Python/TypeScript assistants, code-review tools.
Key metrics.
- Run the code. Nothing else is close. Use
is_valid_python/is_valid_sqlas a cheap gate, then actually execute against tests. code_correctness- judge metric for when tests are impractical (e.g., DSLs, infrastructure code).sql_equivalence- judge-based logical equivalence for SQL.rubric- style, readability, idiom checks.
Common pitfalls.
- Using judge metrics instead of execution. A model that says "looks good" is not a compiler.
- Ignoring validity checks:
is_valid_pythoncatches 40% of bugs for free.
Multimodal¶
What it measures. Text responses grounded in images (or vice versa).
When to use. Vision-language models, OCR pipelines, chart QA, diagram comprehension.
Key metrics.
image_text_alignment- baseline "does the text describe the image".multimodal_faithfulness/visual_faithfulness- claims in text supported by image.visual_hallucination- catches fabricated visual details.ocr_accuracy- pure OCR pipelines.chart_value_extraction- quantitative chart-QA.
Common pitfalls.
- Using text-only faithfulness on multimodal outputs: it will pass as long as the output is internally consistent.
Production guardrails¶
What it measures. Runtime safety and cost/latency bounds before an output reaches the user.
When to use. Every production LLM endpoint. No exceptions.
Key metrics.
- Always:
no_pii,max_tokens,is_refusal(catches jailbreak tells),toxicity(judge or a cheap classifier). - If you have latency budget: add
misuse_detection,role_violation,non_advice. - Structured output:
json_schemawith a Pydantic model.
Common pitfalls.
- Blocking on heavy judge metrics in the hot path: run them async, log failures, decide asynchronously whether to retract.
- Not pairing regex
no_piiwith judge-basedpii_detection: regex misses names, addresses, medical identifiers. - Raising on every failure: use
soft=Truefor observability-only checks.
Safety review¶
What it measures. Pre-launch sweep for harmful behaviour across a corpus.
When to use. Before shipping a new model, system prompt, or RAG corpus.
Key metrics. toxicity, bias, pii_detection, misuse_detection,
optionally role_violation and non_advice for regulated domains.
Common pitfalls.
- Evaluating on synthetic prompts only. Pair with a sampled production log.
- Binary "did any fail?" summary. Look at severity distributions.
Red team¶
What it measures. Resistance to adversarial input - prompt injection, jailbreaks, PII exfiltration, role escape.
When to use. At least once per major release; ideally in CI.
Key tools.
checkllm.redteam.RedTeamer- orchestrates attack generation and scoring.VulnerabilityType- ~150 attack categories mapped to OWASP LLM Top-10 and compliance frameworks.AdversarialAttackEvolver- evolves new attacks when known ones stop working.
Common pitfalls.
- Running only
PROMPT_INJECTION: the interesting failures are usually inINDIRECT_PROMPT_INJECTION,DATA_EXFILTRATION, andCONTEXT_POISONING. - Reporting vulnerability rates without severity weighting.
Compliance¶
What it measures. Domain-specific regulated behaviour - HIPAA, GDPR, PCI-DSS, SOX, FERPA, and friends.
When to use. Healthcare, finance, ed-tech, any regulated vertical.
Key tools.
checkllm.compliance_frameworks.ComplianceScanner- runs the mapped vulnerability set against a target.RedTeamer.scan_compliance(preset=CompliancePreset.HIPAA)- tailored scans.non_advicemetric - enforces "not professional advice" disclaimers in regulated verticals.
Common pitfalls.
- Treating compliance output as a pass/fail gate. Most frameworks need human sign-off; checkllm produces the evidence, not the verdict.
Deterministic vs LLM-judge: when each wins¶
| Dimension | Deterministic | LLM judge |
|---|---|---|
| Latency | Microseconds | 0.5-5s per call |
| Cost | Zero | $0.0001 - $0.01 per call |
| Stability | Identical every run | Variance ~5-15% even with temperature=0 |
| Accuracy ceiling | Only catches what you explicitly pattern-match | Can catch semantic issues you did not pre-specify |
| Debuggability | Trivial - the pattern is the spec | Reasoning field helps, but opaque in aggregate |
| Best for | Format, size, presence/absence, budgets | Grounding, intent, style, subjective quality |
| Worst for | Anything requiring judgement | Anything trivially verifiable |
Rule of thumb.
- Write the deterministic checks first. They are free.
- Add a judge metric per remaining concern.
- If a judge metric could be replaced by a deterministic check, replace it.
A regex is more reliable than an LLM for "output starts with
{". - Cap judge metrics at ~5 per test. Beyond that, latency and variance swamp the signal.
- In production, run deterministic checks synchronously and judge checks asynchronously. Log failures; retract only on critical deterministic failures.
Cost napkin math
1000 evals x 4 judge metrics x $0.002 / call = $8 per run. Trim to 2 judges and you are at $4. Cache identical inputs and you are at pennies.
Migration crib-sheet¶
From DeepEval¶
| DeepEval | checkllm | Import | Notes |
|---|---|---|---|
AnswerRelevancyMetric |
relevance |
checkllm.metrics.relevance.RelevanceMetric |
Same semantics. |
FaithfulnessMetric |
faithfulness |
checkllm.metrics.faithfulness.FaithfulnessMetric |
Same semantics. |
ContextualPrecisionMetric |
contextual_precision |
checkllm.metrics.contextual_precision.ContextualPrecisionMetric |
Same semantics. |
ContextualRecallMetric |
contextual_recall |
checkllm.metrics.contextual_recall.ContextualRecallMetric |
Same semantics. |
ContextualRelevancyMetric |
context_relevance |
checkllm.metrics.context_relevance.ContextRelevanceMetric |
Same semantics. |
HallucinationMetric |
hallucination |
checkllm.metrics.hallucination.HallucinationMetric |
Same semantics; score direction matches (higher=better). |
BiasMetric |
bias |
checkllm.metrics.bias.BiasMetric |
Same semantics. |
ToxicityMetric |
toxicity |
checkllm.metrics.toxicity.ToxicityMetric |
Same semantics. |
SummarizationMetric |
summarization |
checkllm.metrics.summarization.SummarizationMetric |
Same semantics. |
GEval |
g_eval |
checkllm.metrics.g_eval.GEvalMetric |
Same chain-of-thought formulation. |
ToolCorrectnessMetric |
tool_accuracy or tool_call_f1 |
checkllm.metrics.tool_accuracy.ToolAccuracyMetric |
tool_call_f1 is deterministic and free. |
TaskCompletionMetric |
task_completion |
checkllm.metrics.task_completion.TaskCompletionMetric |
Same semantics. |
RoleAdherenceMetric |
role_adherence |
checkllm.metrics.role_adherence.RoleAdherenceMetric |
Same semantics. |
KnowledgeRetentionMetric |
knowledge_retention |
checkllm.metrics.knowledge_retention.KnowledgeRetentionMetric |
Takes a ConversationalTestCase. |
ConversationCompletenessMetric |
conversation_completeness |
checkllm.metrics.conversation_completeness.ConversationCompletenessMetric |
Takes a ConversationalTestCase. |
PIILeakageMetric |
pii_detection (judge) + no_pii (regex) |
checkllm.metrics.pii_detection.PIIDetectionMetric |
Run both; regex catches structured IDs, judge catches names. |
From Ragas¶
| Ragas | checkllm | Import | Notes |
|---|---|---|---|
faithfulness |
faithfulness |
checkllm.metrics.faithfulness.FaithfulnessMetric |
Same. |
answer_relevancy |
relevance |
checkllm.metrics.relevance.RelevanceMetric |
Equivalent; checkllm does not sample synthetic questions. |
context_precision |
contextual_precision |
checkllm.metrics.contextual_precision.ContextualPrecisionMetric |
Same. |
context_recall |
contextual_recall |
checkllm.metrics.contextual_recall.ContextualRecallMetric |
Same. |
context_entity_recall |
context_entity_recall |
checkllm.metrics.context_entity_recall.ContextEntityRecallMetric |
Same. |
noise_sensitivity |
noise_sensitivity |
checkllm.metrics.noise_sensitivity.NoiseSensitivityMetric |
Same. |
answer_correctness |
correctness + factual_correctness |
checkllm.metrics.correctness.CorrectnessMetric |
Ragas fuses factual+semantic; checkllm splits them. |
answer_similarity |
semantic_similarity (deterministic) |
DeterministicChecks.semantic_similarity |
Embedding cosine similarity; no judge required. |
aspect_critic / custom prompts |
rubric or g_eval |
checkllm.metrics.rubric.RubricMetric |
Write the critique as criteria text. |
factual_correctness |
factual_correctness |
checkllm.metrics.factual_correctness.FactualCorrectnessMetric |
Same formulation (claim-level P/R/F1). |
summarization_score |
summarization |
checkllm.metrics.summarization.SummarizationMetric |
Same. |
tool_call_accuracy |
tool_call_f1 or tool_accuracy |
checkllm.metrics.tool_call_f1.ToolCallF1Metric |
tool_call_f1 is deterministic. |
agent_goal_accuracy |
goal_accuracy |
checkllm.metrics.goal_accuracy.GoalAccuracyMetric |
Same. |
topic_adherence |
topic_adherence |
checkllm.metrics.topic_adherence.TopicAdherenceMetric |
Same. |
From promptfoo¶
| promptfoo assertion | checkllm | Import | Notes |
|---|---|---|---|
contains |
contains |
DeterministicChecks.contains |
Same. |
icontains |
icontains |
DeterministicChecks.icontains |
Same. |
regex |
regex |
DeterministicChecks.regex |
Same. |
starts-with / ends-with |
starts_with / ends_with |
DeterministicChecks.starts_with |
Same. |
is-json |
is_json |
DeterministicChecks.is_json |
Same. |
json-schema |
json_schema |
DeterministicChecks.json_schema |
checkllm takes a Pydantic model. |
cost / latency |
cost / latency |
DeterministicChecks.cost / .latency |
Same. |
similar (embedding) |
semantic_similarity |
DeterministicChecks.semantic_similarity |
Same formulation. |
llm-rubric |
rubric |
checkllm.metrics.rubric.RubricMetric |
Same. |
model-graded-factuality |
factual_correctness |
checkllm.metrics.factual_correctness.FactualCorrectnessMetric |
Same. |
answer-relevance |
relevance |
checkllm.metrics.relevance.RelevanceMetric |
Same. |
moderation / toxicity |
toxicity |
checkllm.metrics.toxicity.ToxicityMetric |
Same. checkllm also exposes is_refusal as a cheap pre-filter. |
bleu / rouge / meteor |
bleu / rouge_l / meteor |
DeterministicChecks.bleu / .rouge_l / .meteor |
Same. |
perplexity |
perplexity_check |
DeterministicChecks.perplexity_check |
Same. |
Red-team plugins (jailbreak, ...) |
RedTeamer with VulnerabilityType |
checkllm.redteam.RedTeamer |
checkllm ships ~150 vulnerability types vs ~20 in promptfoo. |
Next steps¶
- Metrics reference - full alphabetical catalog.
examples/test_rag_pipeline.py- RAG metrics composed end-to-end.examples/test_agentic_evaluation.py- agent and tool-use metrics.examples/test_multi_turn_conversation.py-ConversationalTestCasepatterns.examples/test_guardrails_production.py-Guardin production.examples/test_red_team_evaluation.py-RedTeamerusage.