Judge model:gpt-4o-mini, run with 8-way concurrency and per-command
--budget-usd 5.0 caps.
DeepEval cost column reports $0.00 because the DeepEval adapter does
not expose token usage through its metric API; the real API spend is
roughly proportional to CheckLLM's reported cost for the same family.
Ragas is omitted. Importing ragas pulls in torch, which hangs on
Windows in this environment, so the Ragas column is left empty in the
current publish. Unit tests cover the Ragas adapter offline.
JailbreakBench is omitted from this run (Scenario A). The family
jailbreak_resistance is only supported by promptfoo today, the
JBB-Behaviors dataset ships no LLM-under-test answers (only harmful
goals), and a meaningful comparison requires generating target-model
responses before grading. Tracked in
docs/benchmarks/enhancements/remaining-gaps.md.
TruthfulQA is scored as a balanced binary task. Each source row
emits a best_answer sample (label 1.0) and an incorrect_answers[0]
sample (label 0.0), so ROC-AUC is well-defined. --limit 200 yields
400 graded samples per framework.
RAGTruth context_relevance is scored answer-aware for CheckLLM.
The retrieved context alone does not carry a retrieval-relevance label,
so CheckLLM folds the system answer into the judge prompt and grades
whether the context precisely justifies that answer. DeepEval and
promptfoo keep their original context-only semantics.