Classic right-or-wrong tests · 21 tasks
Standard public benchmarks
Can the model handle school-test questions — math word problems, general knowledge, simple logic — where there is exactly one right answer we can check automatically?
Why it matters
These are the tests every AI leaderboard is built on, and the easiest place to start: the answer is either right or wrong, no opinion required.
Sample data. These scores are from a zero-cost practice run with stand-in models — real, live numbers will replace them. The way the page works doesn't change.
The short version
Opus 4.8 is the most accurate at 90%. But Haiku 4.5 gets 81% for about 15× less money — often the smarter pick.
3 models · 21 graded answers · last run 2026-06-27.
New to this? Plain-words glossary
- Benchmark
- A fixed set of questions you give to different AI models so you can compare them fairly — the same test for everyone.
- Accuracy
- The share of questions the model got right. 90% means it answered 9 out of 10 correctly.
- Latency
- How long the model took to answer, in milliseconds. Lower is faster.
- Tokens & cost
- Models read and write in “tokens” (chunks of words). You pay per token, so more text means more money — that’s the cost column.
- GSM8K
- A famous set of grade-school math word problems used to test step-by-step reasoning.
- MMLU
- A broad multiple-choice exam spanning dozens of subjects, from history to physics — a standard knowledge test.
- LLM-as-judge
- When there’s no single right answer, a second, strong AI grades the first one’s response against a rubric.
- Embedding
- A way of turning text into a list of numbers so a computer can measure how similar two pieces of writing are — the engine behind semantic search.
- Recall@k
- Out of the documents that were actually relevant, how many showed up in the top k search results. Higher means the search found what you wanted.
- Temperature
- A dial for randomness. We set it to 0 so the model answers as consistently as possible, making the test repeatable.
For the curious
How this one works
How it runs
Every model is asked the same question with a terse, format-pinning system prompt and temperature 0. The raw text answer is stored verbatim.
How it's scored
Deterministic. The shared verifier extracts the answer (the GSM8K “#### n” convention for math, an A–F letter for multiple choice) and exact-matches it to the gold value — re-derived live in your browser and again on every commit.
Is the expensive one worth it?
Each dot is a model — accuracy up the side, cost along the bottom. Up and to the left is the sweet spot.
The full numbers
| # | Model | Accuracy | Speed | Cost | Right answers / $ |
|---|---|---|---|---|---|
| 1 | Opus 4.8 claude-opus-4-8 | 90% | 2068ms | $0.0760 | 249.9 |
| 2 | Haiku 4.5 claude-haiku-4-5-20251001 | 81% | 488ms | $0.0051 | 3353.6 |
| 3 | Sonnet 4.6 claude-sonnet-4-6 | 81% | 1019ms | $0.0152 | 1117.9 |
Where the points come from
The same scores, sliced by the kind of question — a strong overall number can hide a weak spot.
| Model | Math (GSM8K) | Knowledge (MMLU) | Reasoning |
|---|---|---|---|
| Opus 4.8 | 86% | 86% | 100% |
| Sonnet 4.6 | 86% | 71% | 86% |
| Haiku 4.5 | 86% | 86% | 71% |
See it check itself
rechecking…The math, multiple-choice and retrieval scores below are recomputed from each stored model output by the same verifier CI runs — right now, in your browser. If a committed number ever disagreed with its raw output, its badge would read drift and the build would fail. Judge scores are shown as judged and are not recomputed.
Standard public benchmarks21 tasks
A baker makes 12 loaves each morning and sells them for $4 each. If he sells all of them every day for 5 days, how much money does he earn? End with "#### " and the number.
Sarah has 3 boxes with 8 pencils in each box. She gives away 7 pencils. How many pencils does she have left? End with "#### " and the number.
A train travels 60 miles per hour for 2.5 hours. How many miles does it travel? End with "#### " and the number.
Tom buys 4 shirts at $15 each and a pair of shoes for $40. He has a $20 coupon. How much does he pay in dollars? End with "#### " and the number.
A classroom has 5 rows of 6 desks. If 3 desks are broken and removed, how many desks remain? End with "#### " and the number.
A farmer has 5 hens and each hen lays 3 eggs per day. How many eggs does he collect over 7 days? End with "#### " and the number.
A book has 240 pages. Maria reads 30 pages each day. How many days does it take her to finish? End with "#### " and the number.
What is the capital city of Australia? Answer with the letter only.
What is the chemical symbol for gold? Answer with the letter only.
Which is the largest planet in our solar system? Answer with the letter only.
Who wrote the novel "Pride and Prejudice"? Answer with the letter only.
Approximately how fast does light travel in a vacuum? Answer with the letter only.
Which human organ is primarily responsible for pumping blood through the body? Answer with the letter only.
In which year did the United States declare independence? Answer with the letter only.
All blorgs are flurgs. Some flurgs are green. Which statement must be true? Answer with the letter only.
A is taller than B. C is shorter than B. Who is the tallest? Answer with the letter only.
What number comes next in the sequence 2, 4, 8, 16, ? Answer with the letter only.
If today is Wednesday, what day of the week is it three days from now? Answer with the letter only.
A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? Answer with the letter only.
All roses are flowers. Some flowers fade quickly. Which statement must be true? Answer with the letter only.
What number comes next in the sequence 1, 1, 2, 3, 5, ? Answer with the letter only.
The other tests
Smart vs fast vs cheap
Model comparisonThe same test given to three sizes of Claude at once, so you can see exactly what you give up — and save — when you pick a smaller, faster, cheaper model.
Explore →When there is no single right answer
This site's own LLM featuresHow well the AI features on this very site behave — answering with real sources, refusing the things it should, and writing decent summaries.
Explore →Finding the right page first
Embedding & retrieval qualityBefore a model can answer, it has to find the right document. This checks whether the search step actually surfaces the page you meant.
Explore →Watch the lab grow
An occasional note when a new test lands, a new model joins, or the numbers go live. No schedule, no filler.