Promptopus
Documentation menu

The ReadAloud benchmark

A 10-model, 12-grader benchmark of a real product's TL;DR feature — and what it found.

Promptopus is dogfooded against a real feature: the TL;DR summarizer of ReadAloud, which ships an open 8B model on Cloudflare Workers AI. Two product questions: is the 8B the right pick — or would a bigger / frontier model be better — and are reasoning models worth a look?

So this is a real benchmark: 10 models, 6 article excerpts, a 12-grader battery, using ReadAloud’s exact production prompt (3–5 sentences, no markdown, max_tokens: 512).

Setup

Modelsgpt-4o-mini (OpenAI) plus a Workers AI ladder via the Cloudflare AI Gateway: Llama 3.2-3B → 3.1-8B → 3.1-70B → 3.3-70B, Llama-4-Scout-17B, Mistral-24B, Qwen2.5-32B, Gemma-3-12B, and the QwQ-32B reasoning model.

Graders (12):

  • Simple deterministicnon-empty, max-length, regex (no markdown).
  • LLM-as-judge (gpt-4o) — judge-faithfulness, judge-quality.
  • Cost + latencylatency-budget, cost-budget.
  • Custom (in promptopus.config.mjs, no fork) — sentence-count, no-meta-reference, no-reasoning-leak, number-fidelity, compression.

Results

ModelPassJudgeCost (6)p95
llama-3.1-8b99%1.00$0.00033220 ms
mistral-24b99%0.98$0.00084576 ms
gemma-3-12b97%1.00$0.00081665 ms
llama-4-scout97%0.98$0.00091944 ms
qwen2.5-32b96%0.98$0.00153682 ms
llama-3.1-70b96%1.00$0.00184738 ms
llama-3.2-3b94%0.91$0.00021370 ms
gpt-4o-mini92%1.00$0.00063859 ms
llama-3.3-70b92%0.93$0.00214185 ms
qwq-32b39%0.77$0.003315837 ms

Verdict

The 8B open model is the right call — and bigger is not better.

  • llama-3.1-8b is the sweet spot — 99% pass, perfect faithfulness and quality, ~$0.0003, sub-second median latency. It ties or beats the frontier gpt-4o-mini and every larger model.
  • Scaling up hurt — the largest model, llama-3.3-70b, scored 92% (same as gpt-4o-mini) at ~7× the cost. More parameters mostly bought verbosity.
  • Reasoning models are the wrong toolqwq-32b cratered at 39%. Its output is pure chain-of-thought (“Okay, the user wants a summary… Let me read through…”), caught at once by length, sentence-count, no-reasoning-leak, number-fidelity, judged quality, and a 15.8s p95.

What the custom graders added

The built-ins said “most models are fine”; the custom graders found the story. compression revealed that the “smartest” models (gpt-4o-mini, llama-3.3-70b) are the least concise; no-reasoning-leak isolated QwQ’s failure; number-fidelity confirmed every production model grounded all numbers on the Apollo-11 stress case. A few task-specific graders, written in your own code, turn “looks fine” into a ranked decision. See Extending.

Caveats

  • Judge ceiling / self-judginggpt-4o judges, and shares a family with gpt-4o-mini; the rubric was tightened to avoid saturation. Treat judge scores as directional.
  • Workers AI pricing is approximate — the ordering by size is the signal, not exact cents.
  • Grok needs an xAI key stored in your CF gateway (BYOK) to route through the compat endpoint.

Reproduce

cp .env.example .env   # OPENAI_API_KEY + Cloudflare Workers AI vars
npm install && npm run build
promptopus run suites/readaloud-benchmark.yaml \
  --out results/benchmark.json --max-concurrency 6
promptopus view results/benchmark.json

The custom graders load automatically from promptopus.config.mjs.

Next: What I’d do at scale.