Finding the right page first · 15 tasks
Embedding & retrieval quality
Before a model can answer, it has to find the right document. This checks whether the search step actually surfaces the page you meant.
Why it matters
Most useful AI today is search plus a model. If the search is bad, the answer is bad — so this measures the half people forget to test.
Sample data. These scores are from a zero-cost practice run with stand-in models — real, live numbers will replace them. The way the page works doesn't change.
The short version
voyage-3 scored 93% on this set.
15 graded answers · last run 2026-06-27.
New to this? Plain-words glossary
- Benchmark
- A fixed set of questions you give to different AI models so you can compare them fairly — the same test for everyone.
- Accuracy
- The share of questions the model got right. 90% means it answered 9 out of 10 correctly.
- Latency
- How long the model took to answer, in milliseconds. Lower is faster.
- Tokens & cost
- Models read and write in “tokens” (chunks of words). You pay per token, so more text means more money — that’s the cost column.
- GSM8K
- A famous set of grade-school math word problems used to test step-by-step reasoning.
- MMLU
- A broad multiple-choice exam spanning dozens of subjects, from history to physics — a standard knowledge test.
- LLM-as-judge
- When there’s no single right answer, a second, strong AI grades the first one’s response against a rubric.
- Embedding
- A way of turning text into a list of numbers so a computer can measure how similar two pieces of writing are — the engine behind semantic search.
- Recall@k
- Out of the documents that were actually relevant, how many showed up in the top k search results. Higher means the search found what you wanted.
- Temperature
- A dial for randomness. We set it to 0 so the model answers as consistently as possible, making the test repeatable.
For the curious
How this one works
How it runs
The runner embeds every corpus document and every query once with Voyage (voyage-3), ranks documents by cosine similarity, and stores the ranked id list per query as the upstream artifact.
How it's scored
Deterministic. recall@k is re-derived from the stored ranking and the labeled relevant ids — no live Voyage call is needed to re-verify, so the number checks itself in the browser and in CI.
The full numbers
| # | Model | Accuracy | Speed | Cost | Right answers / $ |
|---|---|---|---|---|---|
| 1 | voyage-3 voyage-3 | 93% | 0ms | $0.0000 | — |
See it check itself
rechecking…The math, multiple-choice and retrieval scores below are recomputed from each stored model output by the same verifier CI runs — right now, in your browser. If a committed number ever disagreed with its raw output, its badge would read drift and the build would fail. Judge scores are shown as judged and are not recomputed.
Embedding & retrieval quality15 tasks
conversation memory engine that tracks goals across turns
old public domain book anthology made searchable
catalog of reusable systems and pipelines with usage counts
forecast dashboard for a family
sub-agents that draft blog metadata and cross-link content
page that re-checks its own claims live and in CI
listen to the site read aloud as a continuous stream
daily webcomic about a developer and AI agents
exam study tool with role permissions and a confirmation gate
walkable simulated town built from civic rules
interview show that is really a networking tool
household budget ledger tracking spending in cents
page that runs self-checking model evaluations
diagram of how the parts of the site connect together
cited local government ordinance corpus you can question
The other tests
Classic right-or-wrong tests
Standard public benchmarksCan the model handle school-test questions — math word problems, general knowledge, simple logic — where there is exactly one right answer we can check automatically?
Explore →Smart vs fast vs cheap
Model comparisonThe same test given to three sizes of Claude at once, so you can see exactly what you give up — and save — when you pick a smaller, faster, cheaper model.
Explore →When there is no single right answer
This site's own LLM featuresHow well the AI features on this very site behave — answering with real sources, refusing the things it should, and writing decent summaries.
Explore →Watch the lab grow
An occasional note when a new test lands, a new model joins, or the numbers go live. No schedule, no filler.