Documentation menu
Getting started
Writing evals
Tooling
Going further
At scale
Known limitations and what a production deployment would add.
Promptopus is a focused, honest core. This page is candid about its current limits and what a production-grade deployment would build on top.
Known limitations
- No response caching yet. Identical
(provider, model, prompt, params)calls re-hit the API on every run. - Judge cost isn’t folded into per-provider cost. Judge calls have their own cost; the report’s cost metrics reflect the candidate model only.
json-schemasupports a common subset (type, required, properties, items, enum, min/max) — not the full JSON Schema spec. Swap in Ajv if you need the rest.- LLM-as-judge is only as good as the judge. Rubric saturation and self-judging bias are real; treat judge scores as directional. See LLM-as-judge.
What I’d add at scale
Caching
A content-addressed cache keyed on (provider, model, prompt, params) so re-runs and CI are near-free,
with a --no-cache escape hatch. Most of a suite is unchanged between runs; only re-call what moved.
Dataset versioning
Hash the suite and its source texts so each report is tied to an exact dataset version. Then a diff between two reports is meaningful — you know the inputs were identical and only the model changed.
CI integration
promptopus run with a --fail-on threshold (minimum pass rate, or a maximum regression delta vs a
baseline report) as a required status check. Upload the report as a build artifact and comment the
summary table on the PR.
Regression tracking across model versions
Store reports over time and chart pass-rate / cost / faithfulness as vendors ship new model snapshots.
A silent quality drop when gpt-4o-mini updates should set off an alarm, not a surprise in production.
Judge robustness
Move from absolute scoring toward pairwise comparison (which output is better, A or B), use multiple judges with an agreement threshold, and keep a small human-labeled held-out set to calibrate the judge periodically.
Architecture that makes this feasible
Because every scoring strategy is a Grader and every model is a Provider, these additions are
mostly orthogonal: caching wraps the provider call, dataset versioning hashes the resolved suite, and
CI thresholds read the existing report. The core doesn’t need to change shape to support them — which
is the point of keeping the interfaces small.
Back to the Introduction.