Finding the right page first · 15 tasks

Embedding & retrieval quality

Before a model can answer, it has to find the right document. This checks whether the search step actually surfaces the page you meant.

Why it matters

Most useful AI today is search plus a model. If the search is bad, the answer is bad — so this measures the half people forget to test.

Sample data. These scores are from a zero-cost practice run with stand-in models — real, live numbers will replace them. The way the page works doesn't change.

The short version

voyage-3 scored 93% on this set.

15 graded answers · last run 2026-06-27.

New to this? Plain-words glossary

Benchmark: A fixed set of questions you give to different AI models so you can compare them fairly — the same test for everyone.
Accuracy: The share of questions the model got right. 90% means it answered 9 out of 10 correctly.
Latency: How long the model took to answer, in milliseconds. Lower is faster.
Tokens & cost: Models read and write in “tokens” (chunks of words). You pay per token, so more text means more money — that’s the cost column.
GSM8K: A famous set of grade-school math word problems used to test step-by-step reasoning.
MMLU: A broad multiple-choice exam spanning dozens of subjects, from history to physics — a standard knowledge test.
LLM-as-judge: When there’s no single right answer, a second, strong AI grades the first one’s response against a rubric.
Embedding: A way of turning text into a list of numbers so a computer can measure how similar two pieces of writing are — the engine behind semantic search.
Recall@k: Out of the documents that were actually relevant, how many showed up in the top k search results. Higher means the search found what you wanted.
Temperature: A dial for randomness. We set it to 0 so the model answers as consistently as possible, making the test repeatable.

For the curious

How this one works

How it runs

The runner embeds every corpus document and every query once with Voyage (voyage-3), ranks documents by cosine similarity, and stores the ranked id list per query as the upstream artifact.

How it's scored

Deterministic. recall@k is re-derived from the stored ranking and the labeled relevant ids — no live Voyage call is needed to re-verify, so the number checks itself in the browser and in CI.

The full numbers

#	Model	Accuracy	Speed	Cost	Right answers / $
1	voyage-3 voyage-3	93%	0ms	$0.0000	—

See it check itself

rechecking…

The math, multiple-choice and retrieval scores below are recomputed from each stored model output by the same verifier CI runs — right now, in your browser. If a committed number ever disagreed with its raw output, its badge would read drift and the build would fail. Judge scores are shown as judged and are not recomputed.

Embedding & retrieval quality15 tasks

rq-01 · site

conversation memory engine that tracks goals across turns

voyage-3✓ re-derived · recall@3 0%

rq-02 · site

old public domain book anthology made searchable