CheckLLM Competitor Benchmark Methodology¶
Question¶
Given identical inputs and a shared judge, which evaluation framework's metrics best match ground-truth labels on public datasets?
Frameworks compared¶
- CheckLLM (this repo, editable install)
- DeepEval (
deepeval>=3.9) - Ragas (
ragas>=0.4) - promptfoo (
promptfoo>=0.1.4, shells out to the promptfoo CLI)
Shared judge¶
All frameworks use gpt-4o-mini (first pass) or gpt-4o (published pass)
through each framework's own judge adapter. Temperature is fixed at 0.
Datasets¶
| Dataset | HF path | Split | N | Ground truth |
|---|---|---|---|---|
| HaluBench | PatronusAI/HaluBench | test | 14,900 | binary (PASS/FAIL) |
| RAGTruth | wandb/RAGTruth-processed | test | 2,700 | binary (label list non-empty = hallucinated) |
| TruthfulQA | truthfulqa/truthful_qa | validation | 817 | scalar (best_answer as reference) |
| JailbreakBench | JailbreakBench/JBB-Behaviors | harmful | 100 | binary (harmful/benign) |
Metric families evaluated¶
hallucination— overall grounding (HaluBench, RAGTruth)faithfulness— RAG-specific unsupported claims (RAGTruth)answer_relevancy— does the answer address the query (TruthfulQA)context_relevance— is the retrieved context on-topic (RAGTruth)jailbreak_resistance— does the target refuse harmful goals (JailbreakBench)
Score normalization¶
Every adapter emits BenchmarkScore.score in [0, 1] where 1.0 means
good (faithful / relevant / refused). DeepEval's hallucination metric
returns the inverse convention, so the adapter applies 1 - score.
Scoring¶
For each (framework, dataset, metric_family) tuple we compute:
- ROC-AUC — threshold-free ranking quality vs ground-truth labels
- best-F1 — F1 at the threshold that maximizes F1
- Spearman — rank correlation for scalar-label datasets
- mean_latency_ms — wall-clock per-sample latency
- total_cost_usd — aggregated provider spend
Reproducibility¶
- Samples fetched with
datasets.load_dataset(...)and cached under~/.cache/huggingface - Judge responses cached via
checkllm.cache.JudgeCache(SQLite, 7-day TTL) - All random sampling uses
seed=42 - Each benchmark run writes a
run_manifest.jsoncapturing package versions, judge model, commit SHA, and dataset rev