DAG Metric¶

CheckLLM's DAGMetric evaluates an LLM output by running it through a directed acyclic graph of judge nodes. Every node is a single LLM-as-judge call, and edges choose the next node based on the current node's verdict. This is the tool of choice when a single rubric is not enough — when downstream checks depend on upstream outcomes.

When to use the DAG metric (vs simple metrics)¶

Reach for a plain metric (FaithfulnessMetric, CorrectnessMetric, GEval, ...) when you want a single scalar answer to a single question. Reach for DAGMetric when any of the following is true:

Different checks need to run based on what the previous check found (for example, only run "remediation analysis" when correctness is low).
Later nodes should see earlier results (parent score / reasoning).
You want a single CheckResult that reflects a multi-step reasoning policy, not an unweighted average of independent metrics.
You want to short-circuit expensive judges when cheaper gates fail.

Quickstart¶

import asyncio

from checkllm.judge import OpenAIJudge
from checkllm.metrics.dag import DAGMetric, DAGNode


async def main() -> None:
    judge = OpenAIJudge(model="gpt-4o-mini")
    dag = DAGMetric(
        judge=judge,
        nodes=[
            DAGNode(
                name="safety",
                prompt_template="Is this output safe? {output}",
                threshold=0.8,
                children_on_pass=["quality"],
            ),
            DAGNode(
                name="quality",
                prompt_template="Rate quality 0-1: {output}",
                threshold=0.7,
                is_leaf=True,
            ),
        ],
        root="safety",
    )
    result = await dag.evaluate(output="Hello, world!")
    print(result.score, result.passed)


asyncio.run(main())

The metric validates the graph at construction time. You'll get a ValueError immediately if the root is missing, a child name is unknown, a threshold is outside [0, 1], or the graph has a cycle.

Conditional branching¶

There are two ways to branch from a node.

Pass / fail — the simplest. Set threshold, children_on_pass, children_on_fail. Children referenced in children_on_pass run when score >= threshold; otherwise children_on_fail run.

Score ranges — use children_on_score_ranges when you need more than a binary split. Keys are (lo, hi) tuples (inclusive on lo, exclusive on hi, except the bucket containing 1.0 includes the upper bound). When both children_on_pass/children_on_fail and children_on_score_ranges are set, the ranges win.

DAGNode(
    name="correctness",
    prompt_template="Is this correct? {output}",
    children_on_score_ranges={
        (0.0, 0.5): ["remediate"],
        (0.5, 0.8): ["improve"],
        (0.8, 1.0): ["polish"],
    },
)

Context passing between nodes¶

Prompts support four placeholders:

Placeholder	Resolves to
`{output}`	The string passed to `evaluate()`
`{parent_score}`	Previous node's score (formatted to four decimals)
`{parent_reasoning}`	Previous node's free-text reasoning
`{context.<key>}`	A lookup in the `context` dict
`{context}`	The whole context dict as a JSON blob

result = await dag.evaluate(
    output=llm_output,
    context={"spec": "Write a pure add(a, b)", "language": "python"},
)

Leaf nodes and final verdicts¶

By default, the DAG's score is the weighted average of every visited node. Set is_leaf=True on any node to mark it as a terminal verdict — when the traversal reaches a leaf, that node's score becomes the final DAG score on its own (the weighted average is ignored). This is useful for flows where a specialised final judge should have the last word.

Sibling children of a non-leaf node run concurrently via asyncio.gather, so wide graphs stay fast. Use DAGMetric.aevaluate_batch(outputs) to evaluate many outputs in parallel.

After any evaluate() call, inspect dag.get_last_path() for the ordered list of DAGEvalResult nodes that were visited. The same trace is serialised into result.reasoning under a [dag-trace] {...} prefix, so it survives JSON logs and CI artefact dumps.

Visualizing your graph (Mermaid)¶

Every DAGMetric can print itself as a Mermaid flowchart via dag.to_mermaid(). Drop the result into a Markdown document or use it to review complex graphs in code review.

flowchart TD
    safety["safety"]
    correctness["correctness"]
    remediation_analysis["remediation_analysis (leaf)"]
    style["style (leaf)"]
    safety_reject["safety_reject (leaf)"]
    safety -- pass --> correctness
    safety -- fail --> safety_reject
    correctness -- "[0.00,0.50)" --> remediation_analysis
    correctness -- "[0.50,1.00)" --> style
    classDef leaf fill:#e0f7e9,stroke:#2e7d32;
    class remediation_analysis,style,safety_reject leaf;

Comparison vs DeepEval DAG¶

Feature	DeepEval `DAGMetric`	CheckLLM `DAGMetric`
Judge nodes with thresholds	Yes	Yes
Pass/fail branching	Yes	Yes
Score-range branching	Limited (binary verdict nodes)	Native `children_on_score_ranges`
Parent score / reasoning in child prompt	Manual plumbing	`{parent_score}` / `{parent_reasoning}` placeholders
User-supplied context	Manual	`context={...}` kwarg + `{context.*}` placeholders
Parallel sibling execution	Sequential	`asyncio.gather` under the hood
Construction-time validation	Runtime errors	Immediate `ValueError` (cycles, bad refs, bad thresholds)
Graph visualization	Not built in	`to_mermaid()`
Batch evaluation	Loop yourself	`aevaluate_batch(outputs)`
Path trace on result	Implicit	`get_last_path()` + `[dag-trace]` JSON prefix in reasoning

The intent is full parity on the DAG mental model, with better ergonomics for the things teams actually do in production: inject rubric-level context, visualise the graph, and batch evaluate across a dataset.