Documentation menu
Getting started
Writing evals
Tooling
Going further
The ReadAloud benchmark
A 10-model, 12-grader benchmark of a real product's TL;DR feature — and what it found.
Promptopus is dogfooded against a real feature: the TL;DR summarizer of ReadAloud, which ships an open 8B model on Cloudflare Workers AI. Two product questions: is the 8B the right pick — or would a bigger / frontier model be better — and are reasoning models worth a look?
So this is a real benchmark: 10 models, 6 article excerpts, a 12-grader battery, using
ReadAloud’s exact production prompt (3–5 sentences, no markdown, max_tokens: 512).
Setup
Models — gpt-4o-mini (OpenAI) plus a Workers AI ladder via the Cloudflare AI Gateway: Llama
3.2-3B → 3.1-8B → 3.1-70B → 3.3-70B, Llama-4-Scout-17B, Mistral-24B, Qwen2.5-32B, Gemma-3-12B, and the
QwQ-32B reasoning model.
Graders (12):
- Simple deterministic —
non-empty,max-length,regex(no markdown). - LLM-as-judge (
gpt-4o) —judge-faithfulness,judge-quality. - Cost + latency —
latency-budget,cost-budget. - Custom (in
promptopus.config.mjs, no fork) —sentence-count,no-meta-reference,no-reasoning-leak,number-fidelity,compression.
Results
| Model | Pass | Judge | Cost (6) | p95 |
|---|---|---|---|---|
| llama-3.1-8b | 99% | 1.00 | $0.0003 | 3220 ms |
| mistral-24b | 99% | 0.98 | $0.0008 | 4576 ms |
| gemma-3-12b | 97% | 1.00 | $0.0008 | 1665 ms |
| llama-4-scout | 97% | 0.98 | $0.0009 | 1944 ms |
| qwen2.5-32b | 96% | 0.98 | $0.0015 | 3682 ms |
| llama-3.1-70b | 96% | 1.00 | $0.0018 | 4738 ms |
| llama-3.2-3b | 94% | 0.91 | $0.0002 | 1370 ms |
| gpt-4o-mini | 92% | 1.00 | $0.0006 | 3859 ms |
| llama-3.3-70b | 92% | 0.93 | $0.0021 | 4185 ms |
| qwq-32b | 39% | 0.77 | $0.0033 | 15837 ms |
Verdict
The 8B open model is the right call — and bigger is not better.
llama-3.1-8bis the sweet spot — 99% pass, perfect faithfulness and quality, ~$0.0003, sub-second median latency. It ties or beats the frontiergpt-4o-miniand every larger model.- Scaling up hurt — the largest model,
llama-3.3-70b, scored 92% (same asgpt-4o-mini) at ~7× the cost. More parameters mostly bought verbosity. - Reasoning models are the wrong tool —
qwq-32bcratered at 39%. Its output is pure chain-of-thought (“Okay, the user wants a summary… Let me read through…”), caught at once by length, sentence-count,no-reasoning-leak,number-fidelity, judged quality, and a 15.8s p95.
What the custom graders added
The built-ins said “most models are fine”; the custom graders found the story. compression revealed
that the “smartest” models (gpt-4o-mini, llama-3.3-70b) are the least concise; no-reasoning-leak
isolated QwQ’s failure; number-fidelity confirmed every production model grounded all numbers on the
Apollo-11 stress case. A few task-specific graders, written in your own code, turn “looks fine” into a
ranked decision. See Extending.
Caveats
- Judge ceiling / self-judging —
gpt-4ojudges, and shares a family withgpt-4o-mini; the rubric was tightened to avoid saturation. Treat judge scores as directional. - Workers AI pricing is approximate — the ordering by size is the signal, not exact cents.
- Grok needs an xAI key stored in your CF gateway (BYOK) to route through the compat endpoint.
Reproduce
cp .env.example .env # OPENAI_API_KEY + Cloudflare Workers AI vars
npm install && npm run build
promptopus run suites/readaloud-benchmark.yaml \
--out results/benchmark.json --max-concurrency 6
promptopus view results/benchmark.json
The custom graders load automatically from promptopus.config.mjs.
Next: What I’d do at scale.