Model-graded declarative assertions¶
CheckLLM's YAML evals support a promptfoo-compatible assertion
vocabulary. Mix deterministic checks (contains, regex, …) with
model-graded checks (llm-rubric, model-graded-relevance, …) without
writing any Python.
Why promptfoo syntax?¶
Most teams evaluating LLMs already know promptfoo's YAML shape. Reusing
it — with the same assert block layout and the same assertion names —
removes a migration barrier. Under the hood each model-graded type maps
onto an existing CheckLLM metric, so you get the full strictness of our
judges plus the ergonomics of promptfoo.
Full example¶
tests:
- prompt: "Summarize: {{text}}"
vars:
text: "Photosynthesis converts sunlight into chemical energy."
assert:
- type: contains
value: "energy"
- type: not-contains
value: "TODO"
- type: regex
value: "\\b[Pp]hoto"
- type: equals
value: "Photosynthesis converts sunlight into chemical energy."
- type: model
prompt: "Is this response professional? Answer with a score 0-1."
threshold: 0.7
- type: llm-rubric
rubric: "Response must be concise, professional, and factually correct."
threshold: 0.75
- type: model-graded-relevance
query: "{{text}}"
threshold: 0.8
- type: model-graded-faithfulness
context: "{{text}}"
threshold: 0.85
- type: similarity
reference: "Photosynthesis turns sunlight into energy."
threshold: 0.8
- type: cost
value: 0.02 # fail if judge cost exceeds $0.02
- type: latency
value: 1500 # fail if measured latency > 1500ms
Assertion type reference¶
| Type | Description | Required fields | Default threshold |
|---|---|---|---|
contains |
Output contains substring | value |
— |
not-contains |
Output lacks substring | value |
— |
regex |
Output matches pattern | value |
— |
equals |
Output equals value (trimmed) | value |
— |
similarity |
Levenshtein ratio vs reference | reference or value |
0.8 |
model |
Free-form judge prompt | prompt |
0.8 |
llm-rubric |
Graded against natural-language rubric | rubric |
0.8 |
model-graded-relevance |
Checks answer relevance to query | query |
0.8 |
model-graded-faithfulness |
Checks answer faithfulness to context | context |
0.8 |
cost |
Last judge call cost <= value (USD) | value |
— |
latency |
Observed latency_ms <= value | value |
— |
Unknown types raise ValueError from parse_assertions, so typos fail
fast at parse time instead of silently passing.
Template variables¶
Fields that accept natural-language input (prompt, rubric, query,
context) support the same {{var}} substitution as YAML prompts.
Variables are resolved from the optional context dict passed to
evaluate_assertions, which typically contains the test's vars
plus observability fields like latency_ms.
Running programmatically¶
from checkllm.yaml_assertions import parse_assertions, evaluate_assertions
from checkllm.providers import create_judge
raw = [
{"type": "contains", "value": "Paris"},
{"type": "llm-rubric", "rubric": "Must be factual.", "threshold": 0.7},
]
assertions = parse_assertions(raw)
judge = create_judge("openai", model="gpt-4o-mini")
results = await evaluate_assertions(
output="The capital of France is Paris.",
assertions=assertions,
judge=judge,
context={"latency_ms": 320},
)
print(results.passed, [r.score for r in results.individual])
evaluate_assertions never raises; individual assertion errors are
caught and turned into a failing CheckResult with the exception in
the reasoning field.
Promptfoo migration table¶
| promptfoo | CheckLLM | Notes |
|---|---|---|
contains |
contains |
Identical. |
not-contains |
not-contains |
Identical. |
regex |
regex |
Python re.search semantics. |
equals |
equals |
Trimmed string compare. |
starts-with |
use regex: "^..." |
Explicit regex is equivalent. |
ends-with |
use regex: "...$" |
Explicit regex is equivalent. |
llm-rubric |
llm-rubric |
Backed by RubricMetric. |
similar |
similarity |
Levenshtein ratio, local, no judge call. |
answer-relevance |
model-graded-relevance |
Backed by RelevanceMetric. |
factuality |
model-graded-faithfulness |
Backed by FaithfulnessMetric. |
cost |
cost |
Checks judge.last_cost. |
latency |
latency |
Reads context["latency_ms"]. |
How cost and latency work¶
-
costinspectsjudge.last_cost, which CheckLLM judges update on everyevaluate(...)call. Place the assertion after any model-graded assertion in the same list so there is something to inspect; a leadingcostassertion will compare against0.0. -
latencydoes not measure itself. It reads the observed latency from thecontextdict passed toevaluate_assertions. Your eval runner is expected to time the LLM call and passcontext={"latency_ms": measured}.
Interoperability with yaml_eval.py¶
The existing yaml_eval.AssertionConfig type continues to work
unchanged. yaml_assertions is additive: new tests that prefer
promptfoo-style types can parse the assert: block through
parse_assertions, while older tests keep their explicit
AssertionConfig objects. The main agent wires up the integration
points in the eval runner so both APIs share the same judge and
budget plumbing.
Error handling¶
- Unknown
type→ValueErrorat parse time. - Missing required field (
rubric,prompt,query, …) →ValueErrorat parse time with the offending index. - Invalid regex pattern at eval time → failing
CheckResultwith there.errormessage. - Judge exceptions at eval time → failing
CheckResultwith the exception text.
This fail-loud-at-parse, fail-softly-at-eval policy keeps one broken assertion from tanking an entire eval run.