Exporting results¶
checkllm stores every eval result as a structured CheckResult, and the
checkllm.reporting package turns those results into the file formats
downstream tools expect: CSV for spreadsheets, Parquet for ML pipelines,
JSONL for streaming consumers, and a single-document JSON for quick
inspection. This guide shows how to pick and use each format.
Why bulk export matters¶
Eval runs generate a lot of data. A single test suite can produce
thousands of CheckResult rows across dozens of metrics, and those
rows are where all the useful information lives: pass/fail state,
per-metric scores, thresholds, costs, latencies, reasoning strings
from judges, and the input preview that triggered the check. To do
anything useful with that data past the terminal summary, you need to
get it out of memory and into a file your analytics stack can load.
The checkllm.reporting.bulk_export module is the one-stop entry
point. It dispatches to the right backend based on the file suffix
(or an explicit format= argument), stamps optional metadata such as
run_id and timestamp_utc into every row, and returns an
ExportSummary describing what it wrote.
from pathlib import Path
from checkllm.reporting.bulk_export import export_results
summary = export_results(results, Path("out/run-42.parquet"))
print(summary.row_count, summary.size_bytes, summary.duration_ms)
CSV (default)¶
CSV is the safest lowest-common-denominator format. Spreadsheets open
it, every database can import it, and pandas.read_csv handles it in
one line. checkllm writes a stable column order so older tooling
never breaks when new columns are added.
from pathlib import Path
from checkllm.reporting.csv_export import write_csv
write_csv(results, Path("out/run.csv"))
Columns: test_name, metric_name, passed, score, threshold, reasoning,
cost, latency_ms, input_preview, run_id, timestamp_utc.
The optional extra_fields= argument appends additional metadata
columns to every row, useful for joining multiple runs into a single
CSV:
write_csv(
results,
Path("out/run.csv"),
extra_fields={"run_id": 42, "timestamp_utc": "2026-04-22T00:00:00Z"},
)
For very large suites, use write_csv_streaming(results_iter, path).
It accepts either the usual dict[str, list[CheckResult]] or a lazy
iterator of (test_name, CheckResult) tuples, so the full result set
never needs to be held in memory.
Parquet (ML pipelines)¶
Parquet is the best choice when results flow into a data lake, a feature store, or a notebook-based analysis. It is columnar, compressed, and ~10x smaller than CSV for the same data.
from pathlib import Path
from checkllm.reporting.parquet_export import write_parquet
write_parquet(results, Path("out/run.parquet"), compression="zstd")
pyarrow is an optional dependency. If it is not installed, the
function raises a clean ImportError with the install hint
pip install checkllm[parquet].
Read a Parquet file back into a dict[str, list[CheckResult]] for
regression comparisons:
from checkllm.reporting.parquet_export import read_parquet
baseline = read_parquet(Path("out/baseline.parquet"))
Supported compression codecs: snappy (default), gzip, brotli,
zstd, lz4, and none.
JSONL (streaming)¶
JSONL (one JSON object per line) is the right format when downstream
consumers stream records line-by-line: a log shipper, a message
queue, a serverless loader. The stream_jsonl helper writes rows
directly from a generator so memory stays bounded even for very large
result sets.
from pathlib import Path
from checkllm.reporting.bulk_export import stream_jsonl
def produce_results():
for test_name, check in expensive_eval_iterator():
yield (test_name, check)
summary = stream_jsonl(produce_results(), Path("out/stream.jsonl"))
print(summary.row_count)
The eager API is also available:
JSON (single-file structured)¶
For configuration-style consumers that want one file they can load
whole, use the single-document JSON format. The payload is
{"results": {test_name: [check_dict, ...]}} with an optional
metadata block.
from pathlib import Path
from checkllm.reporting.bulk_export import export_results
export_results(
results,
Path("out/run.json"),
extra_fields={"run_id": 42, "git_commit": "abc123"},
)
Comparing two runs in the dashboard¶
The dashboard ships a side-by-side comparison view for any two runs
stored in RunHistory. Start the dashboard and navigate to:
The HTML page shows a summary panel (metrics compared, improved,
regressed, unchanged, average delta) and a per-metric delta table.
Rows are color-coded: green for improvements, red for regressions,
neutral for unchanged metrics. Direction is shown as +, -, or
=.
Under the hood the page is backed by
build_comparison(history, a, b), which returns a ComparisonView
suitable for programmatic use:
from checkllm.dashboard import build_comparison
from checkllm.history import RunHistory
history = RunHistory()
view = build_comparison(history, snapshot_a_id=41, snapshot_b_id=42)
print(view.improved)
print(view.regressed)
print(view.metrics_diff)
The JSON endpoint is available at POST /api/compare with either
query parameters (?a=<id>&b=<id>) or a JSON body
({"a": <id>, "b": <id>}). The response is a serialized
ComparisonView plus a summary block.
Integrating with pandas / polars / duckdb¶
All four formats load into the major analytics libraries in a single line.
pandas:
import pandas as pd
df = pd.read_parquet("out/run.parquet")
df = pd.read_csv("out/run.csv")
df = pd.read_json("out/run.jsonl", lines=True)
polars (fastest for Parquet):
import polars as pl
df = pl.read_parquet("out/run.parquet")
df = pl.read_csv("out/run.csv")
df = pl.read_ndjson("out/run.jsonl")
duckdb (no load, queries directly against the file):
import duckdb
duckdb.sql("SELECT metric_name, AVG(score) FROM 'out/run.parquet' GROUP BY 1")
duckdb.sql("SELECT * FROM read_csv_auto('out/run.csv') WHERE passed = false")
duckdb.sql("SELECT * FROM read_json_auto('out/run.jsonl') WHERE score < 0.5")
CI regression gates¶
Export on every CI run with a deterministic filename, then diff the most recent against the previous artifact to fail the build on regressions.
- name: Run evals
run: checkllm run --export out/run-${{ github.sha }}.parquet
- name: Compare against baseline
run: |
python - <<'PY'
from checkllm.reporting.parquet_export import read_parquet
from checkllm.dashboard import build_comparison_view
new = read_parquet("out/run-${{ github.sha }}.parquet")
base = read_parquet("out/baseline.parquet")
view = build_comparison_view(base, new, label_a="baseline", label_b="pr")
regressions = view.regressed
if regressions:
print("Regressed metrics:", regressions)
raise SystemExit(1)
PY
Because read_parquet is a pure Python function returning the same
dict[str, list[CheckResult]] shape as a live run, the same
comparison utilities work equally well against in-memory results and
stored artifacts. Keep the most recent "known good" Parquet file in
the repo or object storage and every PR becomes a zero-extra-work
regression test.