Promptopus turns "which model, at what cost, at what quality?" into a repeatable, version-controlled experiment. Define an eval in YAML, run it across vendors, and compare models side by side — deterministic checks, an LLM judge, and cost/latency, all in one report.
🐙 Promptopus — ReadAloud TL;DR Summarizer 5 cases × 2 providers, concurrency 3 [10/10] ✓ coffee-history × llama-8b Metric gpt-4o-mini llama-8b --------------------- ----------- -------- Pass rate 100% 100% Score · judge 0.96 0.99 Cost · total $0.0005 $0.0002 Latency · p95 2436ms 6880ms 🐙 report written to results.json
The whole loop in 35 seconds.
Define a suite, run it across models, and read the tradeoff — deterministic checks, an LLM judge, and cost/latency, all in one report.
Music: “Cinematic Action Percussion Trailer” by Gregor Quendel (CC-BY 4.0)
"It worked when I tried it" is not an evaluation.
One YAML file. One command. One report you can trust.
Every scoring strategy implements the same Grader.
Mix and match per case — structure gates, quality judges, and economics, in one pass.
equals, contains, regex, is-valid-json, json-schema, max-length, non-empty. Catch contract regressions without spending a token.
Send the output (and source) to a judge model with a rubric. Structured, zod-validated scores; judge failures degrade gracefully.
Tokens, computed USD, and latency are first-class. Set per-call budgets and roll up p50/p95 across the whole suite.
OpenAI, Anthropic, any OpenAI-compatible endpoint (local or Cloudflare Workers AI), plus a keyless mock. One interface to add more.
Rate-limit-aware retry/backoff that honors Retry-After, concurrency control, and per-cell error capture. A failed case never crashes the run.
Every suite is zod-validated. Invalid configs fail with precise, path-pointed messages — never a stack trace.
A static React dashboard reads the JSON report: comparison matrix with best/worst highlighting and per-case drill-down.
Every run writes one JSON artifact — pass rates, mean score per family, cost, p50/p95, and every raw cell — ready for CI and diffs.
No any in core logic. The Provider and Grader interfaces are the architecture; add one by implementing one interface.
Generate a working example suite you can run immediately.
promptopus init
Execute the matrix, with live progress and a summary table.
promptopus run suite.yaml \ --providers a,b \ --out results.json
Open the dashboard and read the tradeoff at a glance.
promptopus view results.json
A real benchmark, 10 models deep.
We benchmarked a real product's TL;DR feature (ReadAloud) across 10 models with a 12-grader battery — simple asserts, an LLM judge, cost/latency, and five custom graders. 60 cells, judged by gpt-4o.
| Model | Pass | Cost |
|---|---|---|
| llama-3.1-8b · ships | 99% | $0.0003 |
| gpt-4o-mini · frontier | 92% | $0.0006 |
| llama-3.3-70b · largest | 92% | $0.0021 |
| qwq-32b · reasoning | 39% | $0.0033 |
The verdict: the 8B open model ReadAloud ships wins — bigger and frontier models cost more for no gain, and a reasoning model cratered at 39%, caught leaking chain-of-thought. Custom graders made the call.
Read the full writeup
No. A built-in mock provider runs a full eval with zero keys and zero cost. For real models, keys are read from your environment (auto-loaded from a .env) and never written into the report.
OpenAI, Anthropic, any OpenAI-compatible endpoint (local servers like Ollama/vLLM, or Cloudflare Workers AI), and a keyless mock. Adding another vendor is one interface.
Yes — define them in your own project via a promptopus.config.mjs (auto-loaded by the CLI) or the library API. Built-in types stay strictly validated; your custom ones flow straight through.
Yes — MIT licensed and published on npm: npm i promptopus. Run it with npx promptopus.
Every call reports tokens and latency; cost is computed from a per-model pricing table. The report rolls up total/mean USD and p50/p95 latency per provider as first-class metrics.
A run writes one JSON report you can diff and gate on, and the runner is resilient — concurrency control, rate-limit-aware retries, and per-cell error capture so one failure never crashes the run.
Install Promptopus, scaffold a suite, and have a comparison report in your terminal before your coffee's cold.