Advanced Patterns¶

Consensus Judging Strategies¶

A single judge can be unreliable. checkllm supports 7 consensus strategies that combine multiple judge backends into a single result.

When to use each strategy¶

Strategy	When to Use
`majority_vote`	Pass/fail decisions; use an odd number of judges
`average`	Continuous scores; balanced ensemble
`weighted_average`	Judges with different reliability or cost profiles
`unanimous`	Safety-critical — all judges must agree
`any_pass`	High-recall — pass if at least one judge approves
`trimmed_mean`	Automatically removes outlier scores
`borda_count`	Rank-based consensus for comparing multiple outputs

Setup¶

from checkllm import ConsensusJudge, OpenAIJudge, AnthropicJudge, GeminiJudge

judges = [
    OpenAIJudge(model="gpt-4o"),
    AnthropicJudge(model="claude-3-5-sonnet-20241022"),
    GeminiJudge(model="gemini-1.5-pro"),
]

GCP-native judge via Vertex AI¶

Enterprises that run on Google Cloud and need VPC-SC, CMEK, or audit logging can use :class:checkllm.VertexAIJudge instead of the public Gemini API. It talks to Vertex AI directly through google-cloud-aiplatform and honors Application Default Credentials.

Install the optional dependency:

pip install checkllm[vertex]

from checkllm import VertexAIJudge

judge = VertexAIJudge(
    model="gemini-1.5-pro",            # or gemini-1.5-flash / gemini-2.0-flash-exp
    project="my-gcp-project",          # falls back to GOOGLE_CLOUD_PROJECT
    location="us-central1",            # falls back to GOOGLE_CLOUD_LOCATION
)

The same judge is also reachable from the factory as create_judge("vertex", project=..., location=...).

majority_vote¶

consensus = ConsensusJudge(judges=judges, strategy="majority_vote")
result = consensus.score(metric="hallucination", output="Paris is in France.")
# Passes if >= 2/3 judges score >= threshold

weighted_average¶

consensus = ConsensusJudge(
    judges=judges,
    strategy="weighted_average",
    weights=[0.5, 0.35, 0.15],   # OpenAI carries 50% of the weight
)

unanimous (safety-critical)¶

consensus = ConsensusJudge(judges=judges, strategy="unanimous", threshold=0.9)
# ALL judges must score >= 0.9; a single disagreement fails the test

any_pass (high-recall)¶

consensus = ConsensusJudge(judges=judges, strategy="any_pass", threshold=0.7)
# Passes if at least one judge scores >= 0.7

trimmed_mean (outlier-resistant)¶

consensus = ConsensusJudge(
    judges=judges,
    strategy="trimmed_mean",
    trim_percent=0.2,   # Drop top and bottom 20% of scores
)

borda_count (ranking multiple outputs)¶

from checkllm import ConsensusJudge, OpenAIJudge

consensus = ConsensusJudge(
    judges=[OpenAIJudge(model="gpt-4o"), OpenAIJudge(model="gpt-4o-mini")],
    strategy="borda_count",
)

outputs = ["Summary A", "Summary B", "Summary C"]
ranked = consensus.rank(metric="summarization", outputs=outputs)
print(ranked[0])   # Highest-ranked output

Prompt Optimization Algorithms¶

checkllm includes 4 optimization algorithms that improve prompts automatically.

When to use each¶

Algorithm	Best For	Cost
`genetic`	Large search spaces, no labelled data needed	Low
`copro`	Few-shot example selection	Medium
`mipro`	Joint instruction + few-shot optimisation	High
`simba`	Multi-step chain / agent prompt optimisation	High

Genetic Algorithm¶

from checkllm import optimize, GeneticOptimizer

@optimize(
    optimizer=GeneticOptimizer(
        population_size=20,
        generations=10,
        mutation_rate=0.2,
    ),
    metric="answer_relevance",
    threshold=0.85,
)
def summarize(text: str) -> str:
    return f"Summarize this: {text}"

best_prompt = summarize.optimize(training_data=my_examples)
print(best_prompt)

COPRO (few-shot selection)¶

from checkllm import optimize, COPROOptimizer

@optimize(
    optimizer=COPROOptimizer(
        breadth=10,             # Candidate prompts per round
        depth=3,                # Refinement rounds
        examples=my_examples,   # list[{"input": ..., "output": ...}]
    ),
    metric="answer_correctness",
)
def answer(question: str) -> str:
    return f"Answer: {question}"

MIPROv2 (instruction + few-shot)¶

from checkllm import optimize, MIPROOptimizer

@optimize(
    optimizer=MIPROOptimizer(
        num_candidates=10,
        num_trials=20,
        max_bootstrapped_demos=4,
    ),
    metric="hallucination",
)
def rag_answer(question: str, context: list[str]) -> str:
    ...

SIMBA (multi-step chains)¶

from checkllm import optimize, SIMBAOptimizer

@optimize(
    optimizer=SIMBAOptimizer(
        num_candidates=16,
        max_steps=4,
    ),
    metric="trajectory_goal_completion",
)
def multi_step_agent(task: str) -> str:
    ...

Cost Workflows¶

Hard budget caps¶

[tool.checkllm]
budget = 5.0   # Stop after $5.00 spent

[tool.checkllm.profiles.ci]
budget = 10.0

Or per pytest run:

pytest tests/ --checkllm-budget 5.0

Estimate before running¶

checkllm estimate tests/

# Estimated cost: $2.34  (47 tests x avg $0.05/test)
# Breakdown:
#   hallucination x 20      $1.20
#   answer_relevance x 15   $0.75
#   summarization x 12      $0.39

Response caching¶

[tool.checkllm]
cache_enabled = true
cache_dir     = ".checkllm/cache"
cache_ttl     = 86400   # 24 h

checkllm cache clear                  # Wipe entire cache
checkllm cache clear --older-than 7d  # Remove stale entries only

Use cheaper models in development¶

[tool.checkllm.profiles.dev]
judge_model = "gpt-4o-mini"          # ~15x cheaper than gpt-4o
budget      = 1.0

[tool.checkllm.profiles.prod]
judge_model = "gpt-4o-2024-11-20"   # Pinned version
budget      = 50.0

CHECKLLM_PROFILE=dev  pytest tests/ -v   # development
CHECKLLM_PROFILE=prod pytest tests/ -v   # production

Cost report in CI¶

- name: Run evaluations
  run: checkllm ci --budget 10.0 --fail-on-regression
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Cost summary
  if: always()
  run: checkllm report --format markdown >> $GITHUB_STEP_SUMMARY