Smart vs fast vs cheap · 21 tasks
Model comparison
The same test given to three sizes of Claude at once, so you can see exactly what you give up — and save — when you pick a smaller, faster, cheaper model.
Why it matters
Bigger is not always worth it. This is where you learn to read accuracy against speed and price — the trade-off every real AI product has to make.
Sample data. These scores are from a zero-cost practice run with stand-in models — real, live numbers will replace them. The way the page works doesn't change.
The short version
Opus 4.8 is the most accurate at 90%. But Haiku 4.5 gets 81% for about 15× less money — often the smarter pick.
3 models · 21 graded answers · last run 2026-06-27.
New to this? Plain-words glossary
- Benchmark
- A fixed set of questions you give to different AI models so you can compare them fairly — the same test for everyone.
- Accuracy
- The share of questions the model got right. 90% means it answered 9 out of 10 correctly.
- Latency
- How long the model took to answer, in milliseconds. Lower is faster.
- Tokens & cost
- Models read and write in “tokens” (chunks of words). You pay per token, so more text means more money — that’s the cost column.
- GSM8K
- A famous set of grade-school math word problems used to test step-by-step reasoning.
- MMLU
- A broad multiple-choice exam spanning dozens of subjects, from history to physics — a standard knowledge test.
- LLM-as-judge
- When there’s no single right answer, a second, strong AI grades the first one’s response against a rubric.
- Embedding
- A way of turning text into a list of numbers so a computer can measure how similar two pieces of writing are — the engine behind semantic search.
- Recall@k
- Out of the documents that were actually relevant, how many showed up in the top k search results. Higher means the search found what you wanted.
- Temperature
- A dial for randomness. We set it to 0 so the model answers as consistently as possible, making the test repeatable.
For the curious
How this one works
How it runs
The same questions and scoring as the public evals, run for Opus, Sonnet and Haiku. Latency is measured per call; cost is computed from stored token counts against a committed price table.
How it's scored
Deterministic accuracy (re-derived live), plus measured latency and reproducible cost. The leaderboard ranks the tiers and surfaces correct answers per dollar.
Is the expensive one worth it?
Each dot is a model — accuracy up the side, cost along the bottom. Up and to the left is the sweet spot.
The full numbers
| # | Model | Accuracy | Speed | Cost | Right answers / $ |
|---|---|---|---|---|---|
| 1 | Opus 4.8 claude-opus-4-8 | 90% | 2068ms | $0.0760 | 249.9 |
| 2 | Haiku 4.5 claude-haiku-4-5-20251001 | 81% | 488ms | $0.0051 | 3353.6 |
| 3 | Sonnet 4.6 claude-sonnet-4-6 | 81% | 1019ms | $0.0152 | 1117.9 |
Where the points come from
The same scores, sliced by the kind of question — a strong overall number can hide a weak spot.
| Model | Math (GSM8K) | Knowledge (MMLU) | Reasoning |
|---|---|---|---|
| Opus 4.8 | 86% | 86% | 100% |
| Sonnet 4.6 | 86% | 71% | 86% |
| Haiku 4.5 | 86% | 86% | 71% |
See it check itself
rechecking…The math, multiple-choice and retrieval scores below are recomputed from each stored model output by the same verifier CI runs — right now, in your browser. If a committed number ever disagreed with its raw output, its badge would read drift and the build would fail. Judge scores are shown as judged and are not recomputed.
Model comparison21 tasks
A baker makes 12 loaves each morning and sells them for $4 each. If he sells all of them every day for 5 days, how much money does he earn? End with "#### " and the number.
Sarah has 3 boxes with 8 pencils in each box. She gives away 7 pencils. How many pencils does she have left? End with "#### " and the number.
A train travels 60 miles per hour for 2.5 hours. How many miles does it travel? End with "#### " and the number.
Tom buys 4 shirts at $15 each and a pair of shoes for $40. He has a $20 coupon. How much does he pay in dollars? End with "#### " and the number.
A classroom has 5 rows of 6 desks. If 3 desks are broken and removed, how many desks remain? End with "#### " and the number.
A farmer has 5 hens and each hen lays 3 eggs per day. How many eggs does he collect over 7 days? End with "#### " and the number.
A book has 240 pages. Maria reads 30 pages each day. How many days does it take her to finish? End with "#### " and the number.
What is the capital city of Australia? Answer with the letter only.
What is the chemical symbol for gold? Answer with the letter only.
Which is the largest planet in our solar system? Answer with the letter only.
Who wrote the novel "Pride and Prejudice"? Answer with the letter only.
Approximately how fast does light travel in a vacuum? Answer with the letter only.
Which human organ is primarily responsible for pumping blood through the body? Answer with the letter only.
In which year did the United States declare independence? Answer with the letter only.
All blorgs are flurgs. Some flurgs are green. Which statement must be true? Answer with the letter only.
A is taller than B. C is shorter than B. Who is the tallest? Answer with the letter only.
What number comes next in the sequence 2, 4, 8, 16, ? Answer with the letter only.
If today is Wednesday, what day of the week is it three days from now? Answer with the letter only.
A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with the letter only.
All roses are flowers. Some flowers fade quickly. Which statement must be true? Answer with the letter only.
What number comes next in the sequence 1, 1, 2, 3, 5, ? Answer with the letter only.
The other tests
Classic right-or-wrong tests
Standard public benchmarksCan the model handle school-test questions — math word problems, general knowledge, simple logic — where there is exactly one right answer we can check automatically?
Explore →When there is no single right answer
This site's own LLM featuresHow well the AI features on this very site behave — answering with real sources, refusing the things it should, and writing decent summaries.
Explore →Finding the right page first
Embedding & retrieval qualityBefore a model can answer, it has to find the right document. This checks whether the search step actually surfaces the page you meant.
Explore →Watch the lab grow
An occasional note when a new test lands, a new model joins, or the numbers go live. No schedule, no filler.