v5.1.0 New APIs¶
This page covers the public APIs added in v5.1.0. For the full release notes see CHANGELOG.md.
Retrieval ranking metrics¶
Classical IR ranking metrics for RAG retrieval evaluation. All are pure Python, deterministic, and require no judge call.
from checkllm.metrics import (
NDCG, MRR, MAPAtK, PrecisionAtK, RecallAtK, HitRateAtK,
)
ndcg = NDCG(k=10)
await ndcg.evaluate(retrieved=["doc1", "doc7", "doc3"], relevant={"doc1", "doc3"})
Each returns a score in [0, 1].
Native Vertex AI judge¶
GCP enterprises can now use Vertex AI without the LiteLLM shim.
from checkllm.providers import VertexAIJudge, create_judge
judge = VertexAIJudge(
model="gemini-2.0-flash",
project="my-gcp-project", # or $GOOGLE_CLOUD_PROJECT
location="us-central1", # or $GOOGLE_CLOUD_LOCATION
)
# or:
judge = create_judge("vertex", model="gemini-2.0-flash", project="my-gcp-project")
Install with pip install checkllm[vertex]. Uses Application Default
Credentials unless credentials= is passed explicitly.
Per-provider rate limiting¶
Dual RPM + TPM token buckets per provider with 429/Retry-After-aware retry.
from checkllm import (
ProviderRateLimiter, RateLimit, RetryConfig,
TokenBucket, retry_with_backoff,
)
limiter = ProviderRateLimiter(limits={
"openai": RateLimit(rpm=500, tpm=30_000),
"anthropic": RateLimit(rpm=50, tpm=20_000),
})
await limiter.acquire("openai", est_tokens=1500)
# ... make call, then:
limiter.release_actual("openai", actual_tokens=1420)
AsyncEngine.submit_judge(...) wraps this automatically. Configure via
CheckllmConfig.rate_limits.
W3C trace-context propagation¶
propagate_trace_context() injects traceparent / tracestate headers
into every outbound judge HTTP call. One evaluation shows as one trace
across service boundaries.
from checkllm import propagate_trace_context
with tracer.span("rag-eval"):
# Every judge call in this block carries the active traceparent.
await check.that(output).is_faithful_to(context)
Install with pip install checkllm[otel].
Anthropic streaming + batch API¶
from checkllm.streaming import StreamingEvaluator
from checkllm import get_batch_runner, AnthropicBatchRunner
evaluator = StreamingEvaluator(metrics=[...])
async for chunk in evaluator.evaluate_provider("anthropic", prompt, model="claude-sonnet-4-5"):
...
runner = get_batch_runner("anthropic", model="claude-sonnet-4-5")
job = await runner.submit(requests)
job = await runner.poll(job)
responses = await runner.retrieve(job) # 50% batch discount auto-applied
CLI: checkllm batch --batch anthropic --dataset eval.yaml.
Cost attribution rollups¶
Every CheckResult now carries a CostBreakdown. New dashboard
endpoints aggregate across providers, metrics, and tests.
GET /api/cost/by-provider
GET /api/cost/by-metric
GET /api/cost/by-test
GET /api/cost/timeseries?bucket=hour|day
Pricing table lives in checkllm.pricing with a 2026-04 snapshot for
OpenAI, Anthropic, Gemini, DeepSeek, and Bedrock variants.
Live progress dashboard¶
from checkllm.dashboard_ws import run
run(host="127.0.0.1", port=8485)
# Browse http://localhost:8485/live
# Or subscribe to ws://localhost:8485/ws/progress
For non-loopback deployments pass token="<shared-secret>" — the /live
page and WS upgrade then require ?token=<value>.
Check registry parity¶
@check decorator symmetric with @metric, plus composition
primitives.
from checkllm import check, AllOf, AnyOf, Not, run_check, CHECK_REGISTRY
@check("has_greeting", tags=("format",))
def has_greeting(output: str) -> CheckResult: ...
combo = AllOf("has_greeting", "max_length_2000", Not("contains_pii"))
result = combo(output="Hello world")
All 39 built-in deterministic checks are auto-registered.
Agent trajectory metrics¶
from checkllm.metrics import (
ToolParameterAccuracyMetric,
ToolSelectionAccuracyMetric,
TrajectoryMetric,
)
from checkllm.agents import ToolCallTrace, traces_from_test_case
TrajectoryMetric scores step ordering (Levenshtein distance vs
expected), loop detection, expected-tool coverage, and unexpected-tool
penalties.
Red-team scorecards¶
from checkllm.redteam_scorecard import (
ExploitSuccessRate,
OWASPTop10LLMScorecard,
SensitiveDataExposureRate,
generate_redteam_report,
)
report = generate_redteam_report(attack_run)
print(report.as_dict())
Experiment analysis¶
from checkllm.analysis.correlation import pairwise_correlations
from checkllm.analysis.significance import welch_t, mann_whitney_u, bootstrap_ci
corrs = pairwise_correlations(snapshot)
sig = welch_t(run_a_scores, run_b_scores) # p, Cohen's d, 95% CI
Vector-store integrations¶
from checkllm.integrations.pinecone import connect as pinecone_connect
from checkllm.integrations.weaviate import connect as weaviate_connect
from checkllm.integrations.milvus import connect as milvus_connect
from checkllm.integrations.chroma import connect as chroma_connect
from checkllm.metrics import KBFaithfulnessMetric
from checkllm.audits.vectordb_freshness import FreshnessAudit
Install with pip install checkllm[vectorstores].
Judge drift detection¶
from checkllm.drift import JudgeBaseline, record_baseline_sync, detect_drift_sync
baseline = record_baseline_sync(judge)
baseline.save("baseline.json")
report = detect_drift_sync(judge, JudgeBaseline.load("baseline.json"))
if report.drifted:
...
CLI:
checkllm drift baseline openai --model gpt-4o-mini
checkllm drift check openai --baseline ./baseline.json
Additional benchmarks¶
New loaders in checkllm.benchmarks:
squad_v2— open-domain QA with abstentionarc_challenge— multi-step reasoningbbh_hard— BIG-Bench Hard subsetsdrop_reading— reading comprehensioncnn_dailymail— summarization (BLEU/ROUGE-L)