← All benchmarks

Finding the right page first · 15 tasks

Embedding & retrieval quality

Before a model can answer, it has to find the right document. This checks whether the search step actually surfaces the page you meant.

Why it matters

Most useful AI today is search plus a model. If the search is bad, the answer is bad — so this measures the half people forget to test.

Sample data. These scores are from a zero-cost practice run with stand-in models — real, live numbers will replace them. The way the page works doesn't change.

voyage-30ms · $0.0000093%

The short version

voyage-3 scored 93% on this set.

15 graded answers · last run 2026-06-27.

New to this? Plain-words glossary
Benchmark
A fixed set of questions you give to different AI models so you can compare them fairly — the same test for everyone.
Accuracy
The share of questions the model got right. 90% means it answered 9 out of 10 correctly.
Latency
How long the model took to answer, in milliseconds. Lower is faster.
Tokens & cost
Models read and write in “tokens” (chunks of words). You pay per token, so more text means more money — that’s the cost column.
GSM8K
A famous set of grade-school math word problems used to test step-by-step reasoning.
MMLU
A broad multiple-choice exam spanning dozens of subjects, from history to physics — a standard knowledge test.
LLM-as-judge
When there’s no single right answer, a second, strong AI grades the first one’s response against a rubric.
Embedding
A way of turning text into a list of numbers so a computer can measure how similar two pieces of writing are — the engine behind semantic search.
Recall@k
Out of the documents that were actually relevant, how many showed up in the top k search results. Higher means the search found what you wanted.
Temperature
A dial for randomness. We set it to 0 so the model answers as consistently as possible, making the test repeatable.

For the curious

How this one works

How it runs

The runner embeds every corpus document and every query once with Voyage (voyage-3), ranks documents by cosine similarity, and stores the ranked id list per query as the upstream artifact.

How it's scored

Deterministic. recall@k is re-derived from the stored ranking and the labeled relevant ids — no live Voyage call is needed to re-verify, so the number checks itself in the browser and in CI.

The full numbers

#ModelAccuracySpeedCostRight answers / $
1voyage-3 voyage-393%0ms$0.0000

See it check itself

rechecking…

The math, multiple-choice and retrieval scores below are recomputed from each stored model output by the same verifier CI runs — right now, in your browser. If a committed number ever disagreed with its raw output, its badge would read drift and the build would fail. Judge scores are shown as judged and are not recomputed.

Embedding & retrieval quality15 tasks
rq-01 · site

conversation memory engine that tracks goals across turns

voyage-3✓ re-derived · recall@3 0%
rq-02 · site

old public domain book anthology made searchable

voyage-3✓ re-derived · recall@3 100%
rq-03 · site

catalog of reusable systems and pipelines with usage counts

voyage-3✓ re-derived · recall@3 100%
rq-04 · site

forecast dashboard for a family

voyage-3✓ re-derived · recall@3 100%
rq-05 · site

sub-agents that draft blog metadata and cross-link content

voyage-3✓ re-derived · recall@3 100%
rq-06 · site

page that re-checks its own claims live and in CI

voyage-3✓ re-derived · recall@3 100%
rq-07 · site

listen to the site read aloud as a continuous stream

voyage-3✓ re-derived · recall@3 100%
rq-08 · site

daily webcomic about a developer and AI agents

voyage-3✓ re-derived · recall@3 100%
rq-09 · site

exam study tool with role permissions and a confirmation gate

voyage-3✓ re-derived · recall@3 100%
rq-10 · site

walkable simulated town built from civic rules

voyage-3✓ re-derived · recall@3 100%
rq-11 · site

interview show that is really a networking tool

voyage-3✓ re-derived · recall@3 100%
rq-12 · site

household budget ledger tracking spending in cents

voyage-3✓ re-derived · recall@3 100%
rq-13 · site

page that runs self-checking model evaluations

voyage-3✓ re-derived · recall@3 100%
rq-14 · site

diagram of how the parts of the site connect together

voyage-3✓ re-derived · recall@3 100%
rq-15 · site

cited local government ordinance corpus you can question

voyage-3✓ re-derived · recall@3 100%

The other tests

Classic right-or-wrong tests

Standard public benchmarks

Can the model handle school-test questions — math word problems, general knowledge, simple logic — where there is exactly one right answer we can check automatically?

Explore →

Smart vs fast vs cheap

Model comparison

The same test given to three sizes of Claude at once, so you can see exactly what you give up — and save — when you pick a smaller, faster, cheaper model.

Explore →

When there is no single right answer

This site's own LLM features

How well the AI features on this very site behave — answering with real sources, refusing the things it should, and writing decent summaries.

Explore →
← Back to the labThe Claims ledgerHow this site worksThe Store

Watch the lab grow

An occasional note when a new test lands, a new model joins, or the numbers go live. No schedule, no filler.