Skip to content

CheckLLM Competitor Benchmark Results

framework dataset metric_family auc best_f1 spearman n mean_latency_ms total_cost_usd rank
checkllm halubench hallucination 0.783 0.796 0.544 200 2415 0.0343 1
deepeval halubench hallucination 0.553 0.701 0.151 200 4457 0.0000 3
promptfoo halubench hallucination 0.753 0.791 0.510 200 1802 0.0292 2
deepeval ragtruth context_relevance 0.435 0.854 -0.100 200 20572 0.0000 3
promptfoo ragtruth context_relevance 0.500 0.854 nan 200 1364 0.0423 2
checkllm ragtruth context_relevance 0.565 0.856 0.125 200 2351 0.0623 1
checkllm ragtruth faithfulness 0.754 0.861 0.424 200 11878 0.0613 1
deepeval ragtruth faithfulness 0.631 0.854 0.205 200 17191 0.0000 2
promptfoo ragtruth faithfulness 0.534 0.856 0.090 200 1693 0.0441 3
checkllm ragtruth hallucination 0.663 0.871 0.398 200 2728 0.0442 1
deepeval ragtruth hallucination 0.588 0.869 0.311 200 3669 0.0000 2
promptfoo ragtruth hallucination 0.513 0.855 0.081 200 1602 0.0441 3
checkllm truthfulqa answer_relevancy 0.546 0.667 0.085 400 6643 0.0213 1
deepeval truthfulqa answer_relevancy 0.438 0.667 -0.122 400 30596 0.0000 2
promptfoo truthfulqa answer_relevancy 0.392 0.667 -0.233 400 1176 0.0247 3

Notes

  • Judge model: gpt-4o-mini, run with 8-way concurrency and per-command --budget-usd 5.0 caps.
  • DeepEval cost column reports $0.00 because the DeepEval adapter does not expose token usage through its metric API; the real API spend is roughly proportional to CheckLLM's reported cost for the same family.
  • Ragas is omitted. Importing ragas pulls in torch, which hangs on Windows in this environment, so the Ragas column is left empty in the current publish. Unit tests cover the Ragas adapter offline.
  • JailbreakBench is omitted from this run (Scenario A). The family jailbreak_resistance is only supported by promptfoo today, the JBB-Behaviors dataset ships no LLM-under-test answers (only harmful goals), and a meaningful comparison requires generating target-model responses before grading. Tracked in docs/benchmarks/enhancements/remaining-gaps.md.
  • TruthfulQA is scored as a balanced binary task. Each source row emits a best_answer sample (label 1.0) and an incorrect_answers[0] sample (label 0.0), so ROC-AUC is well-defined. --limit 200 yields 400 graded samples per framework.
  • RAGTruth context_relevance is scored answer-aware for CheckLLM. The retrieved context alone does not carry a retrieval-relevance label, so CheckLLM folds the system answer into the judge prompt and grades whether the context precisely justifies that answer. DeepEval and promptfoo keep their original context-only semantics.