DAG Metric¶
CheckLLM's DAGMetric evaluates an LLM output by running it through a
directed acyclic graph of judge nodes. Every node is a single
LLM-as-judge call, and edges choose the next node based on the current
node's verdict. This is the tool of choice when a single rubric is not
enough — when downstream checks depend on upstream outcomes.
When to use the DAG metric (vs simple metrics)¶
Reach for a plain metric (FaithfulnessMetric, CorrectnessMetric,
GEval, ...) when you want a single scalar answer to a single
question. Reach for DAGMetric when any of the following is true:
- Different checks need to run based on what the previous check found (for example, only run "remediation analysis" when correctness is low).
- Later nodes should see earlier results (parent score / reasoning).
- You want a single
CheckResultthat reflects a multi-step reasoning policy, not an unweighted average of independent metrics. - You want to short-circuit expensive judges when cheaper gates fail.
Quickstart¶
import asyncio
from checkllm.judge import OpenAIJudge
from checkllm.metrics.dag import DAGMetric, DAGNode
async def main() -> None:
judge = OpenAIJudge(model="gpt-4o-mini")
dag = DAGMetric(
judge=judge,
nodes=[
DAGNode(
name="safety",
prompt_template="Is this output safe? {output}",
threshold=0.8,
children_on_pass=["quality"],
),
DAGNode(
name="quality",
prompt_template="Rate quality 0-1: {output}",
threshold=0.7,
is_leaf=True,
),
],
root="safety",
)
result = await dag.evaluate(output="Hello, world!")
print(result.score, result.passed)
asyncio.run(main())
The metric validates the graph at construction time. You'll get a
ValueError immediately if the root is missing, a child name is
unknown, a threshold is outside [0, 1], or the graph has a cycle.
Conditional branching¶
There are two ways to branch from a node.
Pass / fail — the simplest. Set threshold, children_on_pass,
children_on_fail. Children referenced in children_on_pass run when
score >= threshold; otherwise children_on_fail run.
Score ranges — use children_on_score_ranges when you need more
than a binary split. Keys are (lo, hi) tuples (inclusive on lo,
exclusive on hi, except the bucket containing 1.0 includes the
upper bound). When both children_on_pass/children_on_fail and
children_on_score_ranges are set, the ranges win.
DAGNode(
name="correctness",
prompt_template="Is this correct? {output}",
children_on_score_ranges={
(0.0, 0.5): ["remediate"],
(0.5, 0.8): ["improve"],
(0.8, 1.0): ["polish"],
},
)
Context passing between nodes¶
Prompts support four placeholders:
| Placeholder | Resolves to |
|---|---|
{output} |
The string passed to evaluate() |
{parent_score} |
Previous node's score (formatted to four decimals) |
{parent_reasoning} |
Previous node's free-text reasoning |
{context.<key>} |
A lookup in the context dict |
{context} |
The whole context dict as a JSON blob |
result = await dag.evaluate(
output=llm_output,
context={"spec": "Write a pure add(a, b)", "language": "python"},
)
Leaf nodes and final verdicts¶
By default, the DAG's score is the weighted average of every visited
node. Set is_leaf=True on any node to mark it as a terminal verdict
— when the traversal reaches a leaf, that node's score becomes the
final DAG score on its own (the weighted average is ignored). This is
useful for flows where a specialised final judge should have the last
word.
Sibling children of a non-leaf node run concurrently via
asyncio.gather, so wide graphs stay fast. Use
DAGMetric.aevaluate_batch(outputs) to evaluate many outputs in
parallel.
After any evaluate() call, inspect dag.get_last_path() for the
ordered list of DAGEvalResult nodes that were visited. The same
trace is serialised into result.reasoning under a [dag-trace] {...}
prefix, so it survives JSON logs and CI artefact dumps.
Visualizing your graph (Mermaid)¶
Every DAGMetric can print itself as a Mermaid flowchart via
dag.to_mermaid(). Drop the result into a Markdown document or use it
to review complex graphs in code review.
flowchart TD
safety["safety"]
correctness["correctness"]
remediation_analysis["remediation_analysis (leaf)"]
style["style (leaf)"]
safety_reject["safety_reject (leaf)"]
safety -- pass --> correctness
safety -- fail --> safety_reject
correctness -- "[0.00,0.50)" --> remediation_analysis
correctness -- "[0.50,1.00)" --> style
classDef leaf fill:#e0f7e9,stroke:#2e7d32;
class remediation_analysis,style,safety_reject leaf;
Comparison vs DeepEval DAG¶
| Feature | DeepEval DAGMetric |
CheckLLM DAGMetric |
|---|---|---|
| Judge nodes with thresholds | Yes | Yes |
| Pass/fail branching | Yes | Yes |
| Score-range branching | Limited (binary verdict nodes) | Native children_on_score_ranges |
| Parent score / reasoning in child prompt | Manual plumbing | {parent_score} / {parent_reasoning} placeholders |
| User-supplied context | Manual | context={...} kwarg + {context.*} placeholders |
| Parallel sibling execution | Sequential | asyncio.gather under the hood |
| Construction-time validation | Runtime errors | Immediate ValueError (cycles, bad refs, bad thresholds) |
| Graph visualization | Not built in | to_mermaid() |
| Batch evaluation | Loop yourself | aevaluate_batch(outputs) |
| Path trace on result | Implicit | get_last_path() + [dag-trace] JSON prefix in reasoning |
The intent is full parity on the DAG mental model, with better ergonomics for the things teams actually do in production: inject rubric-level context, visualise the graph, and batch evaluate across a dataset.