Documentation menu
Getting started
Writing evals
Tooling
Going further
Introduction
What Promptopus is, the problem it solves, and how the pieces fit together.
Promptopus is a config-driven LLM evaluation harness. You describe an evaluation in a YAML file — the test cases, the models to compare, and how to grade each output — then run one command to produce a machine-readable report and a visual dashboard for comparing models side by side.
It’s the difference between “it looked fine when I tried it” and “here are the numbers.”
Why it exists
Shipping an LLM feature means continually answering one question: which model, at what cost, at what quality? And then re-answering it every time a vendor ships a new model version. Done by hand, that means eyeballing a handful of outputs and hoping — no record, no regression signal, and cost/latency discovered in the bill rather than the decision.
Promptopus makes that question a repeatable, version-controlled experiment:
- One YAML file defines your cases, providers, and graders.
- One command (
promptopus run) executes the fullcase × providermatrix and writes a JSON report. - One dashboard (
promptopus view) turns that report into a comparison you can reason about.
The mental model
A run is a matrix: every test case is generated by every provider, and each output is scored by one or more graders. The results roll up into a report.
gpt-4o-mini llama-3.1-8b
┌──────────────┬──────────────┐
case A │ graders… │ graders… │
case B │ graders… │ graders… │ → Report (JSON)
case C │ graders… │ graders… │ + dashboard
└──────────────┴──────────────┘
Two interfaces carry the whole design:
- A
Providerwraps one(vendor, model)pair behind a singlegenerate(prompt)call. - A
Graderscores one output and returns{ score, passed, detail }.
Adding a new model vendor or a new scoring strategy means implementing one interface and registering it — nothing else in the runner, report, or dashboard changes. See Core concepts for the full data model.
The three grader families
Promptopus ships three families of graders, all behind that same Grader interface:
- Deterministic — fast, free assertions (
contains,regex,json-schema, …). - LLM-as-judge — a judge model scores faithfulness and quality against a rubric.
- Cost + latency — budgets over the tokens, USD, and latency every call already reports.
Most real suites use all three at once: deterministic gates for structure, judges for quality, and budgets for the economics. See Graders.
What you get out
Every run writes a single JSON report containing per-provider pass rates, mean score per grader family, total/mean cost, p50/p95 latency, and every raw result. That artifact is designed for CI, diffs, and the dashboard.
Next
Head to Getting started to install Promptopus and run your first eval in about a minute — no API keys required for the first run.