Promptopus
Documentation menu

Introduction

What Promptopus is, the problem it solves, and how the pieces fit together.

Promptopus is a config-driven LLM evaluation harness. You describe an evaluation in a YAML file — the test cases, the models to compare, and how to grade each output — then run one command to produce a machine-readable report and a visual dashboard for comparing models side by side.

It’s the difference between “it looked fine when I tried it” and “here are the numbers.”

Why it exists

Shipping an LLM feature means continually answering one question: which model, at what cost, at what quality? And then re-answering it every time a vendor ships a new model version. Done by hand, that means eyeballing a handful of outputs and hoping — no record, no regression signal, and cost/latency discovered in the bill rather than the decision.

Promptopus makes that question a repeatable, version-controlled experiment:

  • One YAML file defines your cases, providers, and graders.
  • One command (promptopus run) executes the full case × provider matrix and writes a JSON report.
  • One dashboard (promptopus view) turns that report into a comparison you can reason about.

The mental model

A run is a matrix: every test case is generated by every provider, and each output is scored by one or more graders. The results roll up into a report.

                gpt-4o-mini      llama-3.1-8b
              ┌──────────────┬──────────────┐
  case A      │  graders…    │  graders…    │
  case B      │  graders…    │  graders…    │   →   Report (JSON)
  case C      │  graders…    │  graders…    │        + dashboard
              └──────────────┴──────────────┘

Two interfaces carry the whole design:

  • A Provider wraps one (vendor, model) pair behind a single generate(prompt) call.
  • A Grader scores one output and returns { score, passed, detail }.

Adding a new model vendor or a new scoring strategy means implementing one interface and registering it — nothing else in the runner, report, or dashboard changes. See Core concepts for the full data model.

The three grader families

Promptopus ships three families of graders, all behind that same Grader interface:

  1. Deterministic — fast, free assertions (contains, regex, json-schema, …).
  2. LLM-as-judge — a judge model scores faithfulness and quality against a rubric.
  3. Cost + latency — budgets over the tokens, USD, and latency every call already reports.

Most real suites use all three at once: deterministic gates for structure, judges for quality, and budgets for the economics. See Graders.

What you get out

Every run writes a single JSON report containing per-provider pass rates, mean score per grader family, total/mean cost, p50/p95 latency, and every raw result. That artifact is designed for CI, diffs, and the dashboard.

Next

Head to Getting started to install Promptopus and run your first eval in about a minute — no API keys required for the first run.