Deployment Guide¶
This guide covers deploying checkllm in production environments, including Docker containers, multi-GPU judge scaling, and operational hardening.
Prerequisites¶
- Python 3.10+ installed
- At least one API key (OpenAI, Anthropic, or Gemini) or a local model server (Ollama, vLLM)
- Docker 24.0+ (for containerised deployments)
- 2 GB RAM minimum; 8 GB+ recommended when running local judge models
Docker: Single Container¶
Dockerfile¶
Create a Dockerfile in your project root:
FROM python:3.12-slim
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential curl \
&& rm -rf /var/lib/apt/lists/*
COPY pyproject.toml requirements.lock ./
RUN pip install --no-cache-dir -r requirements.lock && \
pip install --no-cache-dir "checkllm[all]"
COPY . .
CMD ["pytest", "tests/", "-v"]
Build and run¶
docker build -t myapp-llm-tests .
docker run \
-e OPENAI_API_KEY="$OPENAI_API_KEY" \
-e CHECKLLM_BUDGET=10.0 \
myapp-llm-tests
Environment variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
one required | — | OpenAI judge key |
ANTHROPIC_API_KEY |
one required | — | Anthropic judge key |
GEMINI_API_KEY |
one required | — | Gemini judge key |
CHECKLLM_BUDGET |
No | unlimited | Max spend per run (USD) |
CHECKLLM_JUDGE_MODEL |
No | gpt-4o |
Default judge model |
CHECKLLM_CACHE_ENABLED |
No | true |
Cache judge responses |
CHECKLLM_LOG_LEVEL |
No | WARNING |
Log verbosity |
CHECKLLM_PROFILE |
No | — | Activate a named profile |
Docker Compose: Local Dev with Ollama¶
Run checkllm alongside a local Ollama instance — zero API cost.
docker-compose.yml¶
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 10s
timeout: 5s
retries: 5
checkllm:
build: .
depends_on:
ollama:
condition: service_healthy
environment:
CHECKLLM_JUDGE_MODEL: "ollama/llama3.2"
OLLAMA_BASE_URL: "http://ollama:11434"
command: >
sh -c "
curl -s http://ollama:11434/api/pull -d '{\"model\":\"llama3.2\"}' &&
pytest tests/ -v -m 'not llm'
"
volumes:
ollama_data:
Run¶
Multi-GPU Judge Scaling with vLLM¶
For high-throughput pipelines, run a vLLM server as a local OpenAI-compatible judge backend across multiple GPUs.
Start the vLLM server¶
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \ # set to number of GPUs
--max-model-len 8192
Configure checkllm for vLLM¶
# pyproject.toml
[tool.checkllm]
judge_model = "meta-llama/Llama-3.1-70B-Instruct"
[tool.checkllm.judge_config]
base_url = "http://vllm-server:8000/v1"
api_key = "not-required"
max_concurrency = 50
Or in code:
from checkllm import OpenAICompatibleJudge
judge = OpenAICompatibleJudge(
model="meta-llama/Llama-3.1-70B-Instruct",
base_url="http://vllm-server:8000/v1",
api_key="not-required",
)
Concurrency tuning¶
| GPU memory | Model size | Recommended max_concurrency |
|---|---|---|
| 24 GB (RTX 4090) | 7 B | 32 |
| 2x40 GB (A100) | 13 B | 48 |
| 4x80 GB (H100) | 70 B | 64 |
Production Hardening Checklist¶
- [ ] Set
CHECKLLM_BUDGETto prevent runaway costs - [ ] Enable caching (
CHECKLLM_CACHE_ENABLED=true) to deduplicate evaluations - [ ] Pin the judge model version:
judge_model = "gpt-4o-2024-11-20"(notgpt-4o) - [ ] Tune
max_concurrencyto stay within API rate limits - [ ] Store API keys in a secrets manager (AWS Secrets Manager, HashiCorp Vault)
- [ ] Add
--fail-on-regressionin CI to block score drops - [ ] Run
checkllm estimate tests/before any new CI job to sanity-check spend - [ ] Use
requirements.lockfor reproducible builds (see Lockfile docs)