v5.1.0 New APIs¶

This page covers the public APIs added in v5.1.0. For the full release notes see CHANGELOG.md.

Retrieval ranking metrics¶

Classical IR ranking metrics for RAG retrieval evaluation. All are pure Python, deterministic, and require no judge call.

from checkllm.metrics import (
    NDCG, MRR, MAPAtK, PrecisionAtK, RecallAtK, HitRateAtK,
)

ndcg = NDCG(k=10)
await ndcg.evaluate(retrieved=["doc1", "doc7", "doc3"], relevant={"doc1", "doc3"})

Each returns a score in [0, 1].

Native Vertex AI judge¶

GCP enterprises can now use Vertex AI without the LiteLLM shim.

from checkllm.providers import VertexAIJudge, create_judge

judge = VertexAIJudge(
    model="gemini-2.0-flash",
    project="my-gcp-project",   # or $GOOGLE_CLOUD_PROJECT
    location="us-central1",      # or $GOOGLE_CLOUD_LOCATION
)
# or:
judge = create_judge("vertex", model="gemini-2.0-flash", project="my-gcp-project")

Install with pip install checkllm[vertex]. Uses Application Default Credentials unless credentials= is passed explicitly.

Per-provider rate limiting¶

Dual RPM + TPM token buckets per provider with 429/Retry-After-aware retry.

from checkllm import (
    ProviderRateLimiter, RateLimit, RetryConfig,
    TokenBucket, retry_with_backoff,
)

limiter = ProviderRateLimiter(limits={
    "openai": RateLimit(rpm=500, tpm=30_000),
    "anthropic": RateLimit(rpm=50, tpm=20_000),
})
await limiter.acquire("openai", est_tokens=1500)
# ... make call, then:
limiter.release_actual("openai", actual_tokens=1420)

AsyncEngine.submit_judge(...) wraps this automatically. Configure via CheckllmConfig.rate_limits.

W3C trace-context propagation¶

propagate_trace_context() injects traceparent / tracestate headers into every outbound judge HTTP call. One evaluation shows as one trace across service boundaries.

from checkllm import propagate_trace_context

with tracer.span("rag-eval"):
    # Every judge call in this block carries the active traceparent.
    await check.that(output).is_faithful_to(context)

Install with pip install checkllm[otel].

Anthropic streaming + batch API¶

from checkllm.streaming import StreamingEvaluator
from checkllm import get_batch_runner, AnthropicBatchRunner

evaluator = StreamingEvaluator(metrics=[...])
async for chunk in evaluator.evaluate_provider("anthropic", prompt, model="claude-sonnet-4-5"):
    ...

runner = get_batch_runner("anthropic", model="claude-sonnet-4-5")
job = await runner.submit(requests)
job = await runner.poll(job)
responses = await runner.retrieve(job)  # 50% batch discount auto-applied

CLI: checkllm batch --batch anthropic --dataset eval.yaml.

Cost attribution rollups¶

Every CheckResult now carries a CostBreakdown. New dashboard endpoints aggregate across providers, metrics, and tests.

GET /api/cost/by-provider
GET /api/cost/by-metric
GET /api/cost/by-test
GET /api/cost/timeseries?bucket=hour|day

Pricing table lives in checkllm.pricing with a 2026-04 snapshot for OpenAI, Anthropic, Gemini, DeepSeek, and Bedrock variants.

Live progress dashboard¶

from checkllm.dashboard_ws import run

run(host="127.0.0.1", port=8485)
# Browse http://localhost:8485/live
# Or subscribe to ws://localhost:8485/ws/progress

For non-loopback deployments pass token="<shared-secret>" — the /live page and WS upgrade then require ?token=<value>.

Check registry parity¶

@check decorator symmetric with @metric, plus composition primitives.

from checkllm import check, AllOf, AnyOf, Not, run_check, CHECK_REGISTRY

@check("has_greeting", tags=("format",))
def has_greeting(output: str) -> CheckResult: ...

combo = AllOf("has_greeting", "max_length_2000", Not("contains_pii"))
result = combo(output="Hello world")

All 39 built-in deterministic checks are auto-registered.

Agent trajectory metrics¶

from checkllm.metrics import (
    ToolParameterAccuracyMetric,
    ToolSelectionAccuracyMetric,
    TrajectoryMetric,
)
from checkllm.agents import ToolCallTrace, traces_from_test_case

TrajectoryMetric scores step ordering (Levenshtein distance vs expected), loop detection, expected-tool coverage, and unexpected-tool penalties.

Red-team scorecards¶

from checkllm.redteam_scorecard import (
    ExploitSuccessRate,
    OWASPTop10LLMScorecard,
    SensitiveDataExposureRate,
    generate_redteam_report,
)

report = generate_redteam_report(attack_run)
print(report.as_dict())

Experiment analysis¶

from checkllm.analysis.correlation import pairwise_correlations
from checkllm.analysis.significance import welch_t, mann_whitney_u, bootstrap_ci

corrs = pairwise_correlations(snapshot)
sig = welch_t(run_a_scores, run_b_scores)   # p, Cohen's d, 95% CI

Vector-store integrations¶

from checkllm.integrations.pinecone import connect as pinecone_connect
from checkllm.integrations.weaviate import connect as weaviate_connect
from checkllm.integrations.milvus import connect as milvus_connect
from checkllm.integrations.chroma import connect as chroma_connect
from checkllm.metrics import KBFaithfulnessMetric
from checkllm.audits.vectordb_freshness import FreshnessAudit

Install with pip install checkllm[vectorstores].

Judge drift detection¶

from checkllm.drift import JudgeBaseline, record_baseline_sync, detect_drift_sync

baseline = record_baseline_sync(judge)
baseline.save("baseline.json")

report = detect_drift_sync(judge, JudgeBaseline.load("baseline.json"))
if report.drifted:
    ...

CLI:

checkllm drift baseline openai --model gpt-4o-mini
checkllm drift check openai --baseline ./baseline.json

Additional benchmarks¶

New loaders in checkllm.benchmarks:

squad_v2 — open-domain QA with abstention
arc_challenge — multi-step reasoning
bbh_hard — BIG-Bench Hard subsets
drop_reading — reading comprehension
cnn_dailymail — summarization (BLEU/ROUGE-L)