← All benchmarks

Smart vs fast vs cheap · 21 tasks

Model comparison

The same test given to three sizes of Claude at once, so you can see exactly what you give up — and save — when you pick a smaller, faster, cheaper model.

Why it matters

Bigger is not always worth it. This is where you learn to read accuracy against speed and price — the trade-off every real AI product has to make.

Sample data. These scores are from a zero-cost practice run with stand-in models — real, live numbers will replace them. The way the page works doesn't change.

Opus 4.82068ms · $0.07690%Sonnet 4.61019ms · $0.01581%Haiku 4.5488ms · $0.0050781%

The short version

Opus 4.8 is the most accurate at 90%. But Haiku 4.5 gets 81% for about 15× less money — often the smarter pick.

3 models · 21 graded answers · last run 2026-06-27.

New to this? Plain-words glossary
Benchmark
A fixed set of questions you give to different AI models so you can compare them fairly — the same test for everyone.
Accuracy
The share of questions the model got right. 90% means it answered 9 out of 10 correctly.
Latency
How long the model took to answer, in milliseconds. Lower is faster.
Tokens & cost
Models read and write in “tokens” (chunks of words). You pay per token, so more text means more money — that’s the cost column.
GSM8K
A famous set of grade-school math word problems used to test step-by-step reasoning.
MMLU
A broad multiple-choice exam spanning dozens of subjects, from history to physics — a standard knowledge test.
LLM-as-judge
When there’s no single right answer, a second, strong AI grades the first one’s response against a rubric.
Embedding
A way of turning text into a list of numbers so a computer can measure how similar two pieces of writing are — the engine behind semantic search.
Recall@k
Out of the documents that were actually relevant, how many showed up in the top k search results. Higher means the search found what you wanted.
Temperature
A dial for randomness. We set it to 0 so the model answers as consistently as possible, making the test repeatable.

For the curious

How this one works

How it runs

The same questions and scoring as the public evals, run for Opus, Sonnet and Haiku. Latency is measured per call; cost is computed from stored token counts against a committed price table.

How it's scored

Deterministic accuracy (re-derived live), plus measured latency and reproducible cost. The leaderboard ranks the tiers and surfaces correct answers per dollar.

Is the expensive one worth it?

Each dot is a model — accuracy up the side, cost along the bottom. Up and to the left is the sweet spot.

73%79%86%92%98%total token cost →accuracyOpus 4.890% · $0.076Sonnet 4.681% · $0.015Haiku 4.581% · $0.00507

The full numbers

#ModelAccuracySpeedCostRight answers / $
1Opus 4.8 claude-opus-4-890%2068ms$0.0760249.9
2Haiku 4.5 claude-haiku-4-5-2025100181%488ms$0.00513353.6
3Sonnet 4.6 claude-sonnet-4-681%1019ms$0.01521117.9

Where the points come from

The same scores, sliced by the kind of question — a strong overall number can hide a weak spot.

ModelMath (GSM8K)Knowledge (MMLU)Reasoning
Opus 4.886%86%100%
Sonnet 4.686%71%86%
Haiku 4.586%86%71%

See it check itself

rechecking…

The math, multiple-choice and retrieval scores below are recomputed from each stored model output by the same verifier CI runs — right now, in your browser. If a committed number ever disagreed with its raw output, its badge would read drift and the build would fail. Judge scores are shown as judged and are not recomputed.

Model comparison21 tasks
gsm8k-01 · gsm8kgold: 240

A baker makes 12 loaves each morning and sells them for $4 each. If he sells all of them every day for 5 days, how much money does he earn? End with "#### " and the number.

Opus 4.8✓ re-derived · 241Sonnet 4.6✓ re-derived · 240Haiku 4.5✓ re-derived · 240
gsm8k-02 · gsm8kgold: 17

Sarah has 3 boxes with 8 pencils in each box. She gives away 7 pencils. How many pencils does she have left? End with "#### " and the number.

Opus 4.8✓ re-derived · 17Sonnet 4.6✓ re-derived · 17Haiku 4.5✓ re-derived · 17
gsm8k-03 · gsm8kgold: 150

A train travels 60 miles per hour for 2.5 hours. How many miles does it travel? End with "#### " and the number.

Opus 4.8✓ re-derived · 150Sonnet 4.6✓ re-derived · 150Haiku 4.5✓ re-derived · 150
gsm8k-04 · gsm8kgold: 80

Tom buys 4 shirts at $15 each and a pair of shoes for $40. He has a $20 coupon. How much does he pay in dollars? End with "#### " and the number.

Opus 4.8✓ re-derived · 80Sonnet 4.6✓ re-derived · 80Haiku 4.5✓ re-derived · 81
gsm8k-05 · gsm8kgold: 27

A classroom has 5 rows of 6 desks. If 3 desks are broken and removed, how many desks remain? End with "#### " and the number.

Opus 4.8✓ re-derived · 27Sonnet 4.6✓ re-derived · 28Haiku 4.5✓ re-derived · 27
gsm8k-06 · gsm8kgold: 105

A farmer has 5 hens and each hen lays 3 eggs per day. How many eggs does he collect over 7 days? End with "#### " and the number.

Opus 4.8✓ re-derived · 105Sonnet 4.6✓ re-derived · 105Haiku 4.5✓ re-derived · 105
gsm8k-07 · gsm8kgold: 8

A book has 240 pages. Maria reads 30 pages each day. How many days does it take her to finish? End with "#### " and the number.

Opus 4.8✓ re-derived · 8Sonnet 4.6✓ re-derived · 8Haiku 4.5✓ re-derived · 8
mmlu-01 · mmlugold: C

What is the capital city of Australia? Answer with the letter only.

Opus 4.8✓ re-derived · CSonnet 4.6✓ re-derived · CHaiku 4.5✓ re-derived · C
mmlu-02 · mmlugold: A

What is the chemical symbol for gold? Answer with the letter only.

Opus 4.8✓ re-derived · ASonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · A
mmlu-03 · mmlugold: B

Which is the largest planet in our solar system? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · B
mmlu-04 · mmlugold: C

Who wrote the novel "Pride and Prejudice"? Answer with the letter only.

Opus 4.8✓ re-derived · CSonnet 4.6✓ re-derived · CHaiku 4.5✓ re-derived · D
mmlu-05 · mmlugold: D

Approximately how fast does light travel in a vacuum? Answer with the letter only.

Opus 4.8✓ re-derived · DSonnet 4.6✓ re-derived · DHaiku 4.5✓ re-derived · D
mmlu-06 · mmlugold: A

Which human organ is primarily responsible for pumping blood through the body? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · A
mmlu-07 · mmlugold: B

In which year did the United States declare independence? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · B
reason-01 · reasoninggold: B

All blorgs are flurgs. Some flurgs are green. Which statement must be true? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · B
reason-02 · reasoninggold: A

A is taller than B. C is shorter than B. Who is the tallest? Answer with the letter only.

Opus 4.8✓ re-derived · ASonnet 4.6✓ re-derived · AHaiku 4.5✓ re-derived · B
reason-03 · reasoninggold: C

What number comes next in the sequence 2, 4, 8, 16, ? Answer with the letter only.

Opus 4.8✓ re-derived · CSonnet 4.6✓ re-derived · CHaiku 4.5✓ re-derived · C
reason-04 · reasoninggold: B

If today is Wednesday, what day of the week is it three days from now? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · B
reason-05 · reasoninggold: B

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · C
reason-06 · reasoninggold: B

All roses are flowers. Some flowers fade quickly. Which statement must be true? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · BHaiku 4.5✓ re-derived · B
reason-07 · reasoninggold: B

What number comes next in the sequence 1, 1, 2, 3, 5, ? Answer with the letter only.

Opus 4.8✓ re-derived · BSonnet 4.6✓ re-derived · CHaiku 4.5✓ re-derived · B

The other tests

Classic right-or-wrong tests

Standard public benchmarks

Can the model handle school-test questions — math word problems, general knowledge, simple logic — where there is exactly one right answer we can check automatically?

Explore →

When there is no single right answer

This site's own LLM features

How well the AI features on this very site behave — answering with real sources, refusing the things it should, and writing decent summaries.

Explore →

Finding the right page first

Embedding & retrieval quality

Before a model can answer, it has to find the right document. This checks whether the search step actually surfaces the page you meant.

Explore →
← Back to the labThe Claims ledgerHow this site worksThe Store

Watch the lab grow

An occasional note when a new test lands, a new model joins, or the numbers go live. No schedule, no filler.