Metrics reference¶

A categorized catalog of every built-in metric and deterministic check in checkllm. Use this page as an index; for which metric to pick in a given scenario see Choosing metrics.

Import shortcuts

Every judge metric is also exposed as a method on the check pytest fixture (e.g. check.faithfulness(...)). The Import column shows the direct class import used when you instantiate a metric yourself.

Deterministic checks live on the DeterministicChecks class (from checkllm.deterministic import DeterministicChecks) and as methods on the check fixture (e.g. check.contains(...)).

RAG and grounding¶

Evaluates retrieval quality and whether generation is grounded in retrieved context.

Metric	Import	Type	What it measures	Typical threshold
`faithfulness`	`checkllm.metrics.faithfulness.FaithfulnessMetric`	LLM judge	Fraction of claims in the answer that are supported by context	0.85
`faithfulness_hhem`	`checkllm.metrics.faithfulness_hhem.FaithfulnessHHEMMetric`	Model (HHEM)	HHEM-2.1 hallucination score; local, fast, cheap	0.80
`hallucination`	`checkllm.metrics.hallucination.HallucinationMetric`	LLM judge	Probability the answer contains unsupported claims (lower is better, score inverted)	0.85
`contextual_precision`	`checkllm.metrics.contextual_precision.ContextualPrecisionMetric`	LLM judge	Whether relevant chunks are ranked ahead of irrelevant ones	0.75
`contextual_recall`	`checkllm.metrics.contextual_recall.ContextualRecallMetric`	LLM judge	Fraction of the expected answer found in retrieved context	0.80
`context_relevance`	`checkllm.metrics.context_relevance.ContextRelevanceMetric`	LLM judge	Relevance of retrieved context to the query	0.75
`context_entity_recall`	`checkllm.metrics.context_entity_recall.ContextEntityRecallMetric`	LLM judge	Fraction of ground-truth entities present in context	0.80
`nonllm_context_precision`	`checkllm.metrics.nonllm_context_precision.NonLLMContextPrecisionMetric`	Deterministic	Context precision via string/semantic match (no judge)	0.70
`nonllm_context_recall`	`checkllm.metrics.nonllm_context_recall.NonLLMContextRecallMetric`	Deterministic	Context recall via string/semantic match (no judge)	0.70
`groundedness`	`checkllm.metrics.groundedness.GroundednessMetric`	LLM judge	Whether every factual claim traces back to context	0.85
`noise_sensitivity`	`checkllm.metrics.noise_sensitivity.NoiseSensitivityMetric`	LLM judge	How much irrelevant context degrades the answer	0.75
`citation_accuracy`	`checkllm.metrics.citation_accuracy.CitationAccuracyMetric`	LLM judge	Whether inline citations point to the correct chunk	0.85
`quoted_spans`	`checkllm.metrics.quoted_spans.QuotedSpansAlignmentMetric`	Deterministic	Verbatim-quoted spans actually appear in context	0.95
`nv_answer_accuracy`	`checkllm.metrics.dual_judge_nv.NVAnswerAccuracyMetric`	Dual judge	NVIDIA dual-judge answer accuracy (two models vote)	0.80
`nv_context_relevance`	`checkllm.metrics.dual_judge_nv.NVContextRelevanceMetric`	Dual judge	NVIDIA dual-judge context relevance	0.75
`nv_response_groundedness`	`checkllm.metrics.dual_judge_nv.NVResponseGroundednessMetric`	Dual judge	NVIDIA dual-judge response groundedness	0.80

Answer quality¶

Generic output-quality signals that apply beyond RAG.

Metric	Import	Type	What it measures	Typical threshold
`relevance`	`checkllm.metrics.relevance.RelevanceMetric`	LLM judge	Whether the answer addresses the query	0.80
`correctness`	`checkllm.metrics.correctness.CorrectnessMetric`	LLM judge	Whether the answer matches the expected output	0.80
`factual_correctness`	`checkllm.metrics.factual_correctness.FactualCorrectnessMetric`	LLM judge	Precision/recall/F1 over atomic factual claims	0.80
`answer_completeness`	`checkllm.metrics.answer_completeness.AnswerCompletenessMetric`	LLM judge	Whether every sub-question in the query is answered	0.80
`response_completeness`	`checkllm.metrics.response_completeness.ResponseCompletenessMetric`	LLM judge	Whether the response covers all required aspects	0.80
`fluency`	`checkllm.metrics.fluency.FluencyMetric`	LLM judge	Grammatical and stylistic fluency	0.80
`coherence`	`checkllm.metrics.coherence.CoherenceMetric`	LLM judge	Logical flow and structure	0.80
`consistency`	`checkllm.metrics.consistency.ConsistencyMetric`	LLM judge	Internal self-consistency across the output	0.80
`summarization`	`checkllm.metrics.summarization.SummarizationMetric`	LLM judge	Summary preserves key information without distortion	0.80
`instruction_following`	`checkllm.metrics.instruction_following.InstructionFollowingMetric`	LLM judge	Whether explicit instructions were followed	0.85
`instruction_completeness`	`checkllm.metrics.instruction_completeness.InstructionCompletenessMetric`	LLM judge	Fraction of instructions executed	0.85
`prompt_alignment`	`checkllm.metrics.prompt_alignment.PromptAlignmentMetric`	LLM judge	Response stays aligned with prompt intent and constraints	0.85
`rubric`	`checkllm.metrics.rubric.RubricMetric`	LLM judge	Free-form rubric criteria you supply	0.80
`g_eval`	`checkllm.metrics.g_eval.GEvalMetric`	LLM judge	G-Eval chain-of-thought rubric scoring	0.80
`comparative_quality`	`checkllm.metrics.comparative_quality.ComparativeQualityMetric`	LLM judge	Pairwise comparison against a reference output	0.50
`sentiment`	`checkllm.metrics.sentiment.SentimentMetric`	LLM judge	Polarity; pair with an expected direction	0.50

Agents and tool use¶

Evaluates agent trajectories, tool calls, and planning.

Metric	Import	Type	What it measures	Typical threshold
`tool_call_f1`	`checkllm.metrics.tool_call_f1.ToolCallF1Metric`	Deterministic	F1 of predicted vs expected tool names	0.80
`tool_accuracy`	`checkllm.metrics.tool_accuracy.ToolAccuracyMetric`	LLM judge	Whether the right tools were invoked for the query	0.80
`argument_correctness`	`checkllm.metrics.argument_correctness.ArgumentCorrectnessMetric`	LLM judge	Whether tool arguments are semantically correct	0.80
`task_completion`	`checkllm.metrics.task_completion.TaskCompletionMetric`	LLM judge	Whether the agent finished the requested task	0.80
`goal_accuracy`	`checkllm.metrics.goal_accuracy.GoalAccuracyMetric`	LLM judge	Final output achieves the stated goal	0.80
`plan_quality`	`checkllm.metrics.plan_quality.PlanQualityMetric`	LLM judge	The agent's plan is sensible and minimal	0.75
`plan_adherence`	`checkllm.metrics.plan_adherence.PlanAdherenceMetric`	LLM judge	Execution trace follows the declared plan	0.80
`step_efficiency`	`checkllm.metrics.step_efficiency.StepEfficiencyMetric`	LLM judge	Fraction of steps that were actually necessary	0.70
`trajectory_goal_success`	`checkllm.metrics.trajectory.TrajectoryGoalSuccessMetric`	LLM judge	Whole trajectory reaches the goal	0.80
`trajectory_tool_sequence`	`checkllm.metrics.trajectory.TrajectoryToolSequenceMetric`	Hybrid	Tools invoked in the expected order	0.80
`trajectory_step_count`	`checkllm.metrics.trajectory.TrajectoryStepCountMetric`	Hybrid	Task completed within a step budget	0.75
`trajectory_tool_args_match`	`checkllm.metrics.trajectory.TrajectoryToolArgsMatchMetric`	LLM judge	Tool arguments match expected values (fuzzy)	0.80
`mcp_use`	`checkllm.metrics.mcp_use.McpUseMetric`	LLM judge	MCP tool usage is appropriate	0.80
`mcp_task_completion`	`checkllm.metrics.mcp_task_completion.McpTaskCompletionMetric`	LLM judge	MCP session completed the task	0.80
`multi_turn_mcp_use`	`checkllm.metrics.multi_turn_mcp_use.MultiTurnMcpUseMetric`	LLM judge	Multi-turn MCP interaction stayed on task	0.80

Conversation and multi-turn¶

Evaluates chatbots and multi-turn sessions using ConversationalTestCase.

Metric	Import	Type	What it measures	Typical threshold
`conversation_completeness`	`checkllm.metrics.conversation_completeness.ConversationCompletenessMetric`	LLM judge	Every user request in the transcript was satisfied	0.80
`role_adherence`	`checkllm.metrics.role_adherence.RoleAdherenceMetric`	LLM judge	Assistant stayed in its declared role	0.85
`knowledge_retention`	`checkllm.metrics.knowledge_retention.KnowledgeRetentionMetric`	LLM judge	Assistant remembers information from earlier turns	0.80
`topic_adherence`	`checkllm.metrics.topic_adherence.TopicAdherenceMetric`	LLM judge	Conversation stays within allowed topics	0.85
`turn_relevancy`	`checkllm.metrics.per_turn.TurnRelevancyMetric`	LLM judge	Per-turn response relevance	0.80
`turn_faithfulness`	`checkllm.metrics.per_turn.TurnFaithfulnessMetric`	LLM judge	Per-turn faithfulness to provided context	0.85
`turn_coherence`	`checkllm.metrics.per_turn.TurnCoherenceMetric`	LLM judge	Per-turn coherence with surrounding turns	0.80

Safety and red team¶

Runtime safety signals. Use these as guardrails and inside red-team scans.

Metric	Import	Type	What it measures	Typical threshold
`toxicity`	`checkllm.metrics.toxicity.ToxicityMetric`	LLM judge	Toxic/abusive content probability	0.80
`bias`	`checkllm.metrics.bias.BiasMetric`	LLM judge	Demographic, political, or ideological bias	0.80
`pii_detection`	`checkllm.metrics.pii_detection.PIIDetectionMetric`	LLM judge	Free-text PII that regex misses (names, addresses)	0.90
`misuse_detection`	`checkllm.metrics.misuse_detection.MisuseDetectionMetric`	LLM judge	Output assists harmful or disallowed use	0.90
`role_violation`	`checkllm.metrics.role_violation.RoleViolationMetric`	LLM judge	Response breaks configured role constraints	0.85
`non_advice`	`checkllm.metrics.non_advice.NonAdviceMetric`	LLM judge	Response avoids regulated professional advice	0.85
`no_pii` (deterministic)	`checkllm.deterministic.DeterministicChecks.no_pii`	Deterministic	SSN/email/credit-card regex screen	n/a (pass/fail)
`is_refusal`	`checkllm.deterministic.DeterministicChecks.is_refusal`	Deterministic	Text is a safety refusal ("I can't help with that")	n/a (pass/fail)

Code and structured output¶

Evaluates code, SQL, JSON, and other structured responses.

Metric	Import	Type	What it measures	Typical threshold
`code_correctness`	`checkllm.metrics.code_correctness.CodeCorrectnessMetric`	LLM judge	Code meets functional and quality criteria	0.80
`sql_equivalence`	`checkllm.metrics.sql_equivalence.SQLEquivalenceMetric`	LLM judge	Two SQL queries are logically equivalent	0.85
`datacompy_score`	`checkllm.metrics.datacompy_score.DataCompyMetric`	Deterministic	Pandas dataframe equality via `datacompy`	0.95
`dag`	`checkllm.metrics.dag.DAGMetric`	Composite	Node-graph of metrics with gating edges	custom

Multimodal and vision¶

Evaluates responses involving images or other modalities.

Metric	Import	Type	What it measures	Typical threshold
`image_text_alignment`	`checkllm.metrics.image_text_alignment.ImageTextAlignmentMetric`	LLM judge	Text matches the provided image	0.80
`image_captioning_quality`	`checkllm.metrics.image_captioning_quality.ImageCaptioningQualityMetric`	LLM judge	Caption quality and completeness	0.80
`image_coherence`	`checkllm.metrics.image_coherence.ImageCoherenceMetric`	LLM judge	Output coherent with the image context	0.80
`image_consistency`	`checkllm.metrics.image_consistency.ImageConsistencyMetric`	LLM judge	Multiple images remain consistent	0.80
`image_editing`	`checkllm.metrics.image_editing.ImageEditingMetric`	LLM judge	Edit matches the requested transformation	0.80
`image_helpfulness`	`checkllm.metrics.image_helpfulness.ImageHelpfulnessMetric`	LLM judge	Image actually helps answer the query	0.80
`image_reference`	`checkllm.metrics.image_reference.ImageReferenceMetric`	LLM judge	Text refers to correct image regions	0.80
`image_relevance`	`checkllm.metrics.image_relevance.ImageRelevanceMetric`	LLM judge	Image is relevant to the text query	0.75
`image_safety`	`checkllm.metrics.image_safety.ImageSafetyMetric`	LLM judge	Image content is safe	0.90
`text_to_image`	`checkllm.metrics.text_to_image.TextToImageMetric`	LLM judge	Generated image matches the text prompt	0.80
`multimodal_faithfulness`	`checkllm.metrics.multimodal_faithfulness.MultimodalFaithfulnessMetric`	LLM judge	Answer grounded in both image and text context	0.85
`visual_faithfulness`	`checkllm.metrics.visual_faithfulness.VisualFaithfulnessMetric`	LLM judge	Visual claims are supported by the image	0.85
`visual_hallucination`	`checkllm.metrics.visual_hallucination.VisualHallucinationMetric`	LLM judge	Output hallucinates visual details	0.85
`visual_reasoning`	`checkllm.metrics.visual_reasoning.VisualReasoningMetric`	LLM judge	Correct reasoning over visual content	0.80
`ocr_accuracy`	`checkllm.metrics.ocr_accuracy.OCRAccuracyMetric`	LLM judge	OCR output matches image text	0.90
`chart_value_extraction`	`checkllm.metrics.chart_value_extraction.ChartValueExtractionMetric`	LLM judge	Numeric values extracted from a chart are correct	0.90
`diagram_comprehension`	`checkllm.metrics.diagram_comprehension.DiagramComprehensionMetric`	LLM judge	Understanding of a technical diagram	0.80

Deterministic checks (structural and lexical)¶

Deterministic checks are free, instant, and stable. Reach for these first and only add judge metrics where judgement is required.

Content presence¶

Check	Signature	What it checks
`contains`	`contains(output, substring)`	Substring is present
`not_contains`	`not_contains(output, substring)`	Substring is absent
`icontains`	`icontains(output, substring)`	Case-insensitive contains
`icontains_any` / `icontains_all`	`(output, substrings)`	Any/all substrings present (case-insensitive)
`all_of` / `any_of` / `none_of`	`(output, substrings)`	Set membership over required substrings
`starts_with` / `ends_with`	`(output, prefix/suffix)`	Prefix/suffix check
`exact_match`	`exact_match(output, expected, ignore_case=False)`	Full equality
`exact_match_strict`	`(output, reference, ignore_case, ignore_whitespace)`	Full equality with whitespace options
`regex`	`regex(output, pattern)`	Regex match
`has_structure`	`has_structure(output, elements)`	Required structural elements present
`has_citations`	`has_citations(output, min_count=1)`	At least N citation markers

Size and shape¶

Check	Signature	What it checks
`max_tokens` / `min_tokens`	`(output, limit / minimum)`	Token budget bounds
`word_count`	`word_count(output, min_words=None, max_words=None)`	Word-count bounds
`char_count`	`char_count(output, min_chars=None, max_chars=None)`	Character-count bounds
`sentence_count`	`sentence_count(output, min_sentences=None, max_sentences=None)`	Sentence-count bounds
`readability`	`readability(output, max_grade=None, min_grade=None)`	Flesch-Kincaid grade-level bounds
`no_repetition`	`no_repetition(output, max_ngram_repeat=3)`	No N-gram repeated beyond a limit

Format validation¶

Check	Signature	What it checks
`is_json`	`is_json(output)`	Parses as JSON
`json_schema`	`json_schema(output, schema)`	Conforms to a Pydantic schema
`json_field`	`json_field(output, field_path, expected=None, condition=None)`	JSON field equals/satisfies condition
`is_valid_python`	`is_valid_python(output)`	Parses as Python
`is_valid_sql`	`is_valid_sql(output)`	Parses as SQL
`is_valid_yaml` / `is_yaml`	`(output)`	Parses as YAML
`is_valid_markdown`	`is_valid_markdown(output, require_headers=False, require_lists=False, require_code_blocks=False)`	Markdown structural checks
`is_html` / `contains_html`	`(output)`	HTML structure / fragment
`is_xml` / `contains_xml`	`(output)`	XML structure / fragment
`is_url` / `is_valid_url` / `has_url`	`(output)`	URL validation

Similarity metrics¶

Check	Signature	What it checks
`similarity`	`similarity(output, expected, threshold=0.8, ignore_case=False)`	Token-based Jaccard similarity
`levenshtein`	`levenshtein(output, reference, threshold=0.7)`	Normalised edit distance
`string_distance`	`string_distance(output, reference, method="levenshtein", threshold=0.7)`	Pluggable string distance
`semantic_similarity`	`semantic_similarity(output, reference, threshold=0.7)`	Embedding cosine similarity
`bleu`	`bleu(output, reference, threshold=0.5)`	BLEU n-gram overlap
`rouge_l`	`rouge_l(output, reference, threshold=0.5)`	ROUGE-L longest-common-subsequence
`meteor`	`meteor(output, reference, threshold=0.5)`	METEOR with stemming/synonyms
`gleu`	`gleu(output, reference, threshold=0.5)`	GLEU (Google BLEU)
`chrf`	`chrf(output, reference, threshold=0.5)`	ChrF character-n-gram F-score

Budgets and numeric¶

Check	Signature	What it checks
`max_tokens` / `min_tokens`	`(output, limit / minimum)`	Token budget
`latency`	`latency(actual_ms, max_ms)`	Elapsed latency under bound
`latency_check`	`latency_check(start_time, end_time, max_ms=5000.0)`	Latency from wall-clock times
`cost`	`cost(actual_usd, max_usd)`	USD cost under bound
`cost_check`	`cost_check(input_tokens, output_tokens, model, max_cost=1.0)`	Estimated cost from token counts
`greater_than` / `less_than` / `between`	`(output, threshold [, high])`	Numeric bounds on parsed output
`perplexity_check`	`perplexity_check(output, max_perplexity=50.0)`	Perplexity upper bound

Safety primitives¶

Check	Signature	What it checks
`no_pii`	`no_pii(output, patterns=None)`	No SSN/email/credit-card patterns
`is_refusal`	`is_refusal(output)`	Output is a safety refusal phrase
`language`	`language(output, expected)`	Detected language matches