RESEARCH PAPER

LLM-QP

Query Planning for Large Language Model Inference

Jake LawrenceConstrained DecodingContextual BanditsMLIR / StableHLO
THE IDEA

What If Your LLM Could Choose How to Think?

A paper about making constrained decoding faster by being smarter about which computation to skip.

When you ask an LLM to output JSON, or follow a grammar, or pick from a list of valid options, the model has to do something called constrained decoding. At every step, it scores every possible next token against the rules — even the ones that obviously don’t fit.

Imagine a classroom of 100,000 students. The teacher asks a question, and every single student raises their hand before the teacher checks who actually has the right answer. That’s dense decoding — thorough but wasteful.

LLM-QP asks: what if the teacher already knew which students might have the answer? Instead of checking all 100,000 hands, only check the 50 students who are actually eligible. And if the top student is way ahead of everyone else, maybe don’t even re-ask the question next time — just call on them again.

HOW IT WORKS

Three Tricks, One Framework

1Sparse Scoring

Instead of scoring all 100,000+ tokens in the vocabulary, only score the ones that are actually valid right now. If the grammar says only 50 tokens are legal at this step, why compute scores for the other 99,950? The paper proves these give identical outputs — you get the same answer either way.

2Amortized Queries

If the model is very confident in its answer (the top choice is way ahead of the second place), maybe we don’t need to re-run the full transformer for the next token. The model’s internal state hasn’t changed much, so the answer is probably the same. Think of it like not re-reading the whole textbook for every quiz question when you already know the chapter.

3Bandit Routing

How do you decide which trick to use at each step? LLM-QP frames this as a contextual bandit problem — basically, a traffic controller that learns from experience. It looks at the current situation (how many tokens are valid, how confident the model is, how deep into the sequence we are) and picks the fastest execution plan that won’t sacrifice quality. Over time, it gets better at routing.

+Compiler Integration

All of this plugs into existing ML compiler infrastructure (MLIR / StableHLO). The compiler generates multiple equivalent execution plans and selects the cheapest one at compile time or runtime. No new hardware required — it’s a software optimization layer on top of what already exists.

TEST IT LIVE

The Figures, Recomputed in Your Browser

The paper’s three benchmark figures were static images. Here they are again, computed live from the model’s own equations — drag the parameters and watch each claim hold or break. The contextual-bandit router is a real LinUCB instance learning in front of you.

End-to-end architecture: a constrained-decoding query becomes a constraint graph, the planner routes to the cheapest equivalent execution plan, and telemetry feeds back.

End-to-end architecture: query in, constraint graph defines valid tokens, the planner routes to the cheapest equivalent plan, telemetry feeds back.

Interactive · Claim 1

Sparse scoring wins while K < |V|

Constrained decoding scores only the valid tokens. The dense head costs 2d|V| no matter what; the sparse head costs 2dK, where K is the number of legal tokens right now. Move the model dimension and vocabulary and watch the crossover sit exactly at K = |V|.

Dense head (2d|V|)
Sparse head (2dK)
0.000.170.340.510.682101001.0k10k100k128kK = |V|K (valid tokens, log scale)latency (ms)
0.628 ms
dense step
0.051 ms
sparse @ K=256
12x
speedup @ K=256
K = 128k
crossover
Interactive · Claim 2

Amortize when the model is confident

If the top token is far ahead of the runner-up (a large margin), the next step probably has the same answer, so skip the full recompute. Refine only when the margin falls below the threshold τ. Drag τ: at τ = 0 the model trusts every step (cheapest, riskiest); at τ = 1 it refines always and collapses onto full recompute.

Full recompute every step
Amortized (refine when unsure)
04387130174165129192256decode step tcumulative latency (ms)
35%
refine rate
2.4x
speedup vs full
94.5 ms saved
over 256 steps
Interactive · Claim 3 · the live model

The router learns to route

A real LinUCB contextual bandit (Appendix A, running in your browser) sees each step’s context — how many tokens are valid, how confident the model is — and picks a plan. It is paid the true cost and updates. Watch cumulative regret against the hindsight oracle bend over and flatten: that bend is the convergence the paper claims, and nobody told the router the answer.

1,200 / 1,200 steps
Cumulative regret (lower is better; flat = converged)
0.00.71.42.12.913016019001200decode step tregret (ms)
Average cost per step (oracle = 1.00)
Oracle (hindsight)
1.000
LinUCB router
1.046
Static sparse
1.060
Static dense
6.660
1%
Dense chosen
92%
Sparse chosen
8%
Amortized chosen
2.6 ms
final regret

Deterministic given the seed. The router converges to within a few percent of the unbeatable oracle and leaves both fixed policies behind — and it learns that the conservative dense head is dominated, so it almost never picks it.

The original published run lives in papers/LLM-QP/*.csv; the live model agrees with it in shape, and that agreement is asserted in CI (model.test.js).

WHY IT MATTERS

Faster Structured Output, Same Quality

Every time you ask an LLM to output JSON, fill a form, follow a schema, or generate code that compiles — that’s constrained decoding. It’s one of the most important capabilities for production AI systems, and it’s getting slower as models and vocabularies get bigger.

LLM-QP reframes this as a query planning problem — borrowed from database systems, where optimizers have spent decades learning to pick the cheapest execution plan that gives the same answer. The paper shows you can apply the same idea to LLM inference.

The result: equivalent outputs, less compute. No retraining, no new models, no hardware changes. Just smarter routing.

Read the Full Paper
All the math, proofs, pseudocode, and benchmarks
View PDFDownload PDF
References & lineage
1.Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation (LinUCB). WWW.
2.Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3.
3.Selinger, P. G., et al. (1979). Access path selection in a relational database management system (the System R optimizer). ACM SIGMOD.
4.Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4).
5.Lattner, C., et al. (2021). MLIR: scaling compiler infrastructure for domain-specific computation. CGO.
6.Willard, B. T., & Louf, R. (2023). Efficient guided generation for large language models (constrained decoding). arXiv:2307.09702.

Related

Theory
LLM-QP
Direct system-paper relationship documenting the same research
Method
The Word Machine
Both optimize large language model architectures and inference
Method
SAGEN
Both develop architectural solutions for AI system optimization
Method
Stable Match
Both use algorithmic approaches to solve complex optimization problems

Need something like this built?

I design and ship AI tools, full-stack apps, and data pipelines — end to end, to production. Tell me the problem in a sentence; I'll give you an honest read on fit within a day.

Work with me →