| pytest native |
Yes |
Yes |
No |
No |
| Free deterministic checks |
39 (composable via @check + AllOf/AnyOf/Not) |
Limited |
No |
Yes |
| LLM-as-judge metrics |
72+ |
14+ |
8+ |
Custom |
| Retrieval ranking metrics |
NDCG, MRR, MAP@k, P@k, R@k, HitRate@k |
Partial |
Partial |
No |
| Agent trajectory eval |
Tool-param + tool-selection + trajectory (order / loops / coverage) |
Tool-usage metric |
No |
No |
| Red-team / safety |
151 vuln types, 25 strategies, OWASP Top-10 LLM scorecard, ExploitSuccessRate |
Built-in red-team |
No |
Plugin |
| Multi-provider judges |
9 backends incl. native Vertex AI |
OpenAI-focused |
OpenAI-focused |
Multiple |
| Consensus judging |
7 strategies |
No |
No |
No |
| Per-provider rate limiting |
Dual RPM + TPM buckets, 429 / Retry-After aware retry |
No |
No |
No |
| Batch API (cost savings) |
OpenAI + Anthropic (50% discount auto-applied) |
No |
No |
No |
| Distributed tracing |
W3C traceparent propagation + Langfuse / LangSmith / Datadog / Prometheus |
Partial |
No |
No |
| Production guardrails |
Built-in |
No |
No |
No |
| Cost estimation + attribution |
Per-metric, per-test, per-provider rollups + /api/cost/* endpoints |
No |
No |
No |
| Auto-detect judge |
Yes |
No |
No |
No |
| Live progress dashboard |
/live page + /ws/progress WebSocket |
No |
No |
No |
| Vector-store integrations |
Pinecone, Weaviate, Milvus, Chroma (+ KB faithfulness, freshness audit) |
No |
No |
No |
| Judge drift detection |
Canonical probe baselines + CLI checkllm drift |
No |
No |
No |
| Experiment analysis |
Pearson / Spearman + Welch's t + Mann-Whitney U + Cohen's d + bootstrap CI |
No |
No |
A/B only |
| Benchmarks shipped |
21 (MMLU, TruthfulQA, GSM8K, HumanEval, SQuAD 2.0, ARC, BBH, DROP, CNN/DM, …) |
Several |
Few |
Via custom |
| Fluent chaining |
check.that() |
No |
No |
No |
| Plugin system |
Entry points |
No |
No |
Custom |
| Runtime overhead |
Zero (plugin) |
Framework |
Framework |
CLI |
| Language |
Python |
Python |
Python |
YAML + JS |