Retriever Validation¶
Evaluate retrieval before the LLM sees the context. Most RAG failures are retrieval failures: if the retriever hands the model irrelevant, incomplete, or poorly-ranked chunks, no amount of prompt engineering will save the answer. CheckLLM ships framework-native retriever wrappers so you can score retrieval quality in-pipeline or as a regression gate in CI.
Why evaluate at the retriever¶
Classic LLM-output evaluation (faithfulness, hallucination, toxicity) catches symptoms. Retriever evaluation catches causes:
- A low context precision score says the top-ranked chunks are not the most relevant ones. Your reranker is broken.
- A low context recall score says the retrieved set does not contain the evidence needed to answer. Your index coverage is too narrow.
- A low context relevance score says too much of what you retrieved is noise. Your chunk size or embeddings are off.
Fixing these at the retriever costs nothing in tokens. Letting them reach the LLM costs every downstream call.
LangChain quickstart¶
Wrap any LangChain BaseRetriever so every call is scored in-line:
from checkllm.integrations.langchain_retriever import CheckllmRetrieverWrapper
from checkllm.testing import MockJudge # or a real JudgeBackend
judge = MockJudge(default_score=0.9) # swap for OpenAIJudge() in prod
wrapped = CheckllmRetrieverWrapper(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
metrics=["context_precision", "context_recall", "context_relevance"],
judge=judge,
expected_context="Python is a high-level programming language.",
)
docs = wrapped.invoke("What is Python?")
print(wrapped.summary())
# {'context_precision': {'mean': 0.91, 'median': 0.92, 'pass_rate': 1.0, 'count': 1.0}, ...}
The wrapper is a drop-in replacement: documents are returned unchanged, so
your existing chain sees the same List[Document] it always did. The scored
records accumulate in wrapped.results: list[RetrievalEvalResult] and can be
summarised or streamed to your observability stack via the on_retrieval
callback:
def stream_to_datadog(docs, metrics):
for name, cr in metrics.items():
statsd.gauge(f"retriever.{name}.score", cr.score)
wrapped = CheckllmRetrieverWrapper(
retriever=base,
on_retrieval=stream_to_datadog,
metrics=["context_relevance"],
judge=judge,
)
LlamaIndex quickstart¶
The LlamaIndex wrapper mirrors the LangChain API and plugs into
QueryEngine.from_args(retriever=...):
from llama_index.core import VectorStoreIndex
from checkllm.integrations.llamaindex_retriever import CheckllmRetrieverWrapper
index = VectorStoreIndex.from_documents(docs)
base = index.as_retriever(similarity_top_k=5)
wrapped = CheckllmRetrieverWrapper(
retriever=base,
metrics=["context_precision", "context_relevance"],
judge=judge,
)
query_engine = index.as_query_engine(retriever=wrapped)
response = query_engine.query("What is Python?")
wrapped.results and wrapped.summary() work identically to the LangChain
wrapper. Text is extracted from each NodeWithScore via
node.node.text with a fallback to get_content().
Composable vs one-shot evaluation¶
There are two usage modes:
- Composable (wrapper) — wrap the retriever once, let it evaluate every live retrieval. Use this in development, staging, or low-traffic production to surface regressions as they happen.
- One-shot (
evaluate_retriever) — run a fixed dataset ofCase(query=..., context=ground_truth)through the retriever and return scored results. Use this in CI as a regression gate.
from checkllm.datasets.case import Case
from checkllm.integrations.langchain_retriever import evaluate_retriever
cases = [
Case(input="What is Python?", context="Python is a high-level language."),
Case(input="Who created Linux?", context="Linus Torvalds created Linux."),
]
records = await evaluate_retriever(
retriever=vectorstore.as_retriever(),
cases=cases,
metrics=["context_precision", "context_recall"],
judge=judge,
)
overall = sum(r.pass_rate for r in records) / len(records)
assert overall >= 0.9, f"Retriever regressed: {overall:.2f}"
Interpreting retrieval metrics¶
| Metric | What it answers | Needs judge | Needs ground truth |
|---|---|---|---|
context_precision |
Are relevant chunks ranked first? | yes | yes (expected) |
context_recall |
Did we retrieve enough to answer? | yes | yes (expected) |
context_relevance |
How much of the retrieved text is on-topic? | yes | no |
nonllm_context_precision |
Same as precision, but via word overlap (zero cost) | no | yes |
nonllm_context_recall |
Same as recall, but via word overlap (zero cost) | no | yes |
noise_sensitivity |
Does irrelevant context change the output? | yes | no |
Start with the nonllm_* variants for quick CI gates, then add the LLM-based
variants once you have a judge budget.
CI integration example¶
# .github/workflows/retriever-regression.yml
- name: Retriever regression
run: |
python -m pytest tests/retriever_eval/ -q
# tests/retriever_eval/test_retriever.py
import asyncio
import pytest
from checkllm.datasets.case import Case
from checkllm.integrations.langchain_retriever import evaluate_retriever
from myapp.retrieval import build_retriever
FIXTURES = [
Case(input="What is Python?", context="Python is a high-level language."),
Case(input="Who created Linux?", context="Linus Torvalds created Linux."),
]
def test_retriever_meets_baseline():
retriever = build_retriever()
records = asyncio.run(
evaluate_retriever(
retriever,
cases=FIXTURES,
metrics=["nonllm_context_precision", "nonllm_context_recall"],
)
)
pass_rate = sum(r.pass_rate for r in records) / len(records)
assert pass_rate >= 0.8, f"Retriever pass rate {pass_rate:.2f} below baseline"
Comparison vs Ragas¶
Ragas pioneered retriever-level evaluation but couples it tightly to its own runner. CheckLLM's retriever wrappers fit into whichever orchestrator you already use:
- Framework-native:
CheckllmRetrieverWrappersubclasses LangChain'sBaseRetrieverand LlamaIndex'sBaseRetriever, so it works everywhere those types work --RetrievalQA,QueryEngine, LCEL pipelines, agent tools -- without code changes. - Cost-aware: every
RetrievalEvalResultcarriestotal_cost, so you can enforce per-query or per-suite judge budgets with the existing CheckLLM budget primitives. - Non-LLM path: the
nonllm_context_*metrics are a zero-cost CI gate equivalent to Ragas's lexical baselines, but first-class in the API. - Same CheckResult: retriever metrics emit the same
CheckResultyour other guardrails and output metrics emit, so dashboards, Prometheus exporters, and Langfuse traces stay unified.
If you are migrating from Ragas, most pipelines only need to rename
context_precision / context_recall / context_relevance and replace the
Ragas runner with evaluate_retriever.