Promptopus
Proven on a real product eval · see the numbers

Stop guessing which model to ship.

Promptopus turns "which model, at what cost, at what quality?" into a repeatable, version-controlled experiment. Define an eval in YAML, run it across vendors, and compare models side by side — deterministic checks, an LLM judge, and cost/latency, all in one report.

$ npx promptopus init
npm version ● TypeScript, strict — no any in core ● 52 unit tests ● MIT licensed
promptopus run readaloud-summarizer.yaml
🐙 Promptopus — ReadAloud TL;DR Summarizer
  5 cases × 2 providers, concurrency 3

  [10/10] ✓ coffee-history × llama-8b

Metric                 gpt-4o-mini  llama-8b
---------------------  -----------  --------
Pass rate              100%         100%
Score · judge          0.96         0.99
Cost · total           $0.0005      $0.0002
Latency · p95          2436ms       6880ms

🐙 report written to results.json
Runs against OpenAI Anthropic Cloudflare Workers AI Any OpenAI-compatible endpoint Local models

See it move

The whole loop in 35 seconds.

Define a suite, run it across models, and read the tradeoff — deterministic checks, an LLM judge, and cost/latency, all in one report.

Try it now

Music: “Cinematic Action Percussion Trailer” by Gregor Quendel (CC-BY 4.0)

The problem

"It worked when I tried it" is not an evaluation.

  • You eyeball a few outputs, pick a model, and hope. No numbers, no record.
  • A new model version ships and you have no way to know if quality silently regressed.
  • Cost and latency are an afterthought — discovered in the bill, not the decision.

The fix

One YAML file. One command. One report you can trust.

  • Codify your test cases, models, and grading once — re-run it anytime, in CI.
  • Score with deterministic checks, an LLM judge, and budgets — together.
  • Get a machine-readable report and a dashboard that makes the tradeoff obvious.

Three grader families, one interface.

Every scoring strategy implements the same Grader. Mix and match per case — structure gates, quality judges, and economics, in one pass.

free · instant

Deterministic asserts

equals, contains, regex, is-valid-json, json-schema, max-length, non-empty. Catch contract regressions without spending a token.

faithfulness · quality

LLM-as-judge

Send the output (and source) to a judge model with a rubric. Structured, zod-validated scores; judge failures degrade gracefully.

budgets · p50/p95

Cost + latency

Tokens, computed USD, and latency are first-class. Set per-call budgets and roll up p50/p95 across the whole suite.

Built like a tool you'd trust in CI.

Pluggable providers

OpenAI, Anthropic, any OpenAI-compatible endpoint (local or Cloudflare Workers AI), plus a keyless mock. One interface to add more.

Resilient runs

Rate-limit-aware retry/backoff that honors Retry-After, concurrency control, and per-cell error capture. A failed case never crashes the run.

Friendly configs

Every suite is zod-validated. Invalid configs fail with precise, path-pointed messages — never a stack trace.

A real dashboard

A static React dashboard reads the JSON report: comparison matrix with best/worst highlighting and per-case drill-down.

Machine-readable report

Every run writes one JSON artifact — pass rates, mean score per family, cost, p50/p95, and every raw cell — ready for CI and diffs.

Strict TypeScript

No any in core logic. The Provider and Grader interfaces are the architecture; add one by implementing one interface.

From zero to a report in three steps.

1

Scaffold

Generate a working example suite you can run immediately.

promptopus init
2

Run

Execute the matrix, with live progress and a summary table.

promptopus run suite.yaml \
  --providers a,b \
  --out results.json
3

Compare

Open the dashboard and read the tradeoff at a glance.

promptopus view results.json

We dogfood it

A real benchmark, 10 models deep.

We benchmarked a real product's TL;DR feature (ReadAloud) across 10 models with a 12-grader battery — simple asserts, an LLM judge, cost/latency, and five custom graders. 60 cells, judged by gpt-4o.

Model Pass Cost
llama-3.1-8b · ships99%$0.0003
gpt-4o-mini · frontier92%$0.0006
llama-3.3-70b · largest92%$0.0021
qwq-32b · reasoning39%$0.0033

The verdict: the 8B open model ReadAloud ships wins — bigger and frontier models cost more for no gain, and a reasoning model cratered at 39%, caught leaking chain-of-thought. Custom graders made the call.

Read the full writeup
Promptopus dashboard — 10-model benchmark comparison matrix

Questions, answered.

Do I need API keys to try it?

No. A built-in mock provider runs a full eval with zero keys and zero cost. For real models, keys are read from your environment (auto-loaded from a .env) and never written into the report.

Which providers are supported?

OpenAI, Anthropic, any OpenAI-compatible endpoint (local servers like Ollama/vLLM, or Cloudflare Workers AI), and a keyless mock. Adding another vendor is one interface.

Can I add my own grader or provider without forking?

Yes — define them in your own project via a promptopus.config.mjs (auto-loaded by the CLI) or the library API. Built-in types stay strictly validated; your custom ones flow straight through.

Is it free and open source?

Yes — MIT licensed and published on npm: npm i promptopus. Run it with npx promptopus.

How are cost and latency measured?

Every call reports tokens and latency; cost is computed from a per-model pricing table. The report rolls up total/mean USD and p50/p95 latency per provider as first-class metrics.

Does it fit into CI?

A run writes one JSON report you can diff and gate on, and the runner is resilient — concurrency control, rate-limit-aware retries, and per-cell error capture so one failure never crashes the run.

Make your next model decision evidence-based.

Install Promptopus, scaffold a suite, and have a comparison report in your terminal before your coffee's cold.