Skip to content

v5.1.0 New APIs

This page covers the public APIs added in v5.1.0. For the full release notes see CHANGELOG.md.

Retrieval ranking metrics

Classical IR ranking metrics for RAG retrieval evaluation. All are pure Python, deterministic, and require no judge call.

from checkllm.metrics import (
    NDCG, MRR, MAPAtK, PrecisionAtK, RecallAtK, HitRateAtK,
)

ndcg = NDCG(k=10)
await ndcg.evaluate(retrieved=["doc1", "doc7", "doc3"], relevant={"doc1", "doc3"})

Each returns a score in [0, 1].

Native Vertex AI judge

GCP enterprises can now use Vertex AI without the LiteLLM shim.

from checkllm.providers import VertexAIJudge, create_judge

judge = VertexAIJudge(
    model="gemini-2.0-flash",
    project="my-gcp-project",   # or $GOOGLE_CLOUD_PROJECT
    location="us-central1",      # or $GOOGLE_CLOUD_LOCATION
)
# or:
judge = create_judge("vertex", model="gemini-2.0-flash", project="my-gcp-project")

Install with pip install checkllm[vertex]. Uses Application Default Credentials unless credentials= is passed explicitly.

Per-provider rate limiting

Dual RPM + TPM token buckets per provider with 429/Retry-After-aware retry.

from checkllm import (
    ProviderRateLimiter, RateLimit, RetryConfig,
    TokenBucket, retry_with_backoff,
)

limiter = ProviderRateLimiter(limits={
    "openai": RateLimit(rpm=500, tpm=30_000),
    "anthropic": RateLimit(rpm=50, tpm=20_000),
})
await limiter.acquire("openai", est_tokens=1500)
# ... make call, then:
limiter.release_actual("openai", actual_tokens=1420)

AsyncEngine.submit_judge(...) wraps this automatically. Configure via CheckllmConfig.rate_limits.

W3C trace-context propagation

propagate_trace_context() injects traceparent / tracestate headers into every outbound judge HTTP call. One evaluation shows as one trace across service boundaries.

from checkllm import propagate_trace_context

with tracer.span("rag-eval"):
    # Every judge call in this block carries the active traceparent.
    await check.that(output).is_faithful_to(context)

Install with pip install checkllm[otel].

Anthropic streaming + batch API

from checkllm.streaming import StreamingEvaluator
from checkllm import get_batch_runner, AnthropicBatchRunner

evaluator = StreamingEvaluator(metrics=[...])
async for chunk in evaluator.evaluate_provider("anthropic", prompt, model="claude-sonnet-4-5"):
    ...

runner = get_batch_runner("anthropic", model="claude-sonnet-4-5")
job = await runner.submit(requests)
job = await runner.poll(job)
responses = await runner.retrieve(job)  # 50% batch discount auto-applied

CLI: checkllm batch --batch anthropic --dataset eval.yaml.

Cost attribution rollups

Every CheckResult now carries a CostBreakdown. New dashboard endpoints aggregate across providers, metrics, and tests.

GET /api/cost/by-provider
GET /api/cost/by-metric
GET /api/cost/by-test
GET /api/cost/timeseries?bucket=hour|day

Pricing table lives in checkllm.pricing with a 2026-04 snapshot for OpenAI, Anthropic, Gemini, DeepSeek, and Bedrock variants.

Live progress dashboard

from checkllm.dashboard_ws import run

run(host="127.0.0.1", port=8485)
# Browse http://localhost:8485/live
# Or subscribe to ws://localhost:8485/ws/progress

For non-loopback deployments pass token="<shared-secret>" — the /live page and WS upgrade then require ?token=<value>.

Check registry parity

@check decorator symmetric with @metric, plus composition primitives.

from checkllm import check, AllOf, AnyOf, Not, run_check, CHECK_REGISTRY

@check("has_greeting", tags=("format",))
def has_greeting(output: str) -> CheckResult: ...

combo = AllOf("has_greeting", "max_length_2000", Not("contains_pii"))
result = combo(output="Hello world")

All 39 built-in deterministic checks are auto-registered.

Agent trajectory metrics

from checkllm.metrics import (
    ToolParameterAccuracyMetric,
    ToolSelectionAccuracyMetric,
    TrajectoryMetric,
)
from checkllm.agents import ToolCallTrace, traces_from_test_case

TrajectoryMetric scores step ordering (Levenshtein distance vs expected), loop detection, expected-tool coverage, and unexpected-tool penalties.

Red-team scorecards

from checkllm.redteam_scorecard import (
    ExploitSuccessRate,
    OWASPTop10LLMScorecard,
    SensitiveDataExposureRate,
    generate_redteam_report,
)

report = generate_redteam_report(attack_run)
print(report.as_dict())

Experiment analysis

from checkllm.analysis.correlation import pairwise_correlations
from checkllm.analysis.significance import welch_t, mann_whitney_u, bootstrap_ci

corrs = pairwise_correlations(snapshot)
sig = welch_t(run_a_scores, run_b_scores)   # p, Cohen's d, 95% CI

Vector-store integrations

from checkllm.integrations.pinecone import connect as pinecone_connect
from checkllm.integrations.weaviate import connect as weaviate_connect
from checkllm.integrations.milvus import connect as milvus_connect
from checkllm.integrations.chroma import connect as chroma_connect
from checkllm.metrics import KBFaithfulnessMetric
from checkllm.audits.vectordb_freshness import FreshnessAudit

Install with pip install checkllm[vectorstores].

Judge drift detection

from checkllm.drift import JudgeBaseline, record_baseline_sync, detect_drift_sync

baseline = record_baseline_sync(judge)
baseline.save("baseline.json")

report = detect_drift_sync(judge, JudgeBaseline.load("baseline.json"))
if report.drifted:
    ...

CLI:

checkllm drift baseline openai --model gpt-4o-mini
checkllm drift check openai --baseline ./baseline.json

Additional benchmarks

New loaders in checkllm.benchmarks:

  • squad_v2 — open-domain QA with abstention
  • arc_challenge — multi-step reasoning
  • bbh_hard — BIG-Bench Hard subsets
  • drop_reading — reading comprehension
  • cnn_dailymail — summarization (BLEU/ROUGE-L)