FULL PAPER

LLM-QP: Query Planning for Large Language Model Inference

Jake Lawrence  ·  Independent Researcher

SECTION 1

Plan Equivalence

Two inference plans P1,P2P_1, P_2 are equivalent if they produce identical token sequences under identical model and constraint semantics.

Definition (ϵ\epsilon-Equivalence)

Two plans are ϵ\epsilon-equivalent if logits differ by at most ϵ\epsilon over the valid candidate set and produce identical decoded outputs.

Lemma: Dense and Sparse Equivalence

Let A(s)\mathcal{A}(s) denote the valid token set. Dense masked scoring:

~(v)={σ(h,v)vA(s)otherwise\tilde{\ell}(v) = \begin{cases} \sigma(h,v) & v \in \mathcal{A}(s) \\ -\infty & \text{otherwise} \end{cases}

Sparse scoring:

(v)=σ(h,v),vA(s)\ell(v) = \sigma(h,v), \quad v \in \mathcal{A}(s)

Then:

argmaxvV~(v)=argmaxvA(s)(v)\arg\max_{v \in V} \tilde{\ell}(v) = \arg\max_{v \in \mathcal{A}(s)} \ell(v)
SECTION 2

Roofline Regime Analysis

Dense head compute:

Wdense2dVW_{dense} \approx 2d|V|

Sparse head compute:

Wsparse2dKW_{sparse} \approx 2dK

Sparse dominates when:

K<VK < |V|
SECTION 3

Amortized Query Hypothesis

We model decoder state as:

ht=hstable+Δhth_t = h^{stable} + \Delta h_t

Full transformer recomputation may provide limited incremental information when margins are large. Define margin:

mt=(v(1))(v(2))m_t = \ell(v^{(1)}) - \ell(v^{(2)})

Routing decision — refine if:

mt<τm_t < \tau

Otherwise use amortized scoring.

SECTION 4

Planner Routing as a Contextual Bandit

At decoding step tt, the planner observes a context vector xtXx_t \in \mathcal{X} and selects an execution plan atAa_t \in \mathcal{A}. We define a scalar loss:

t(a)=λLatencyt(a)+(1λ)QualityLosst(a)\ell_t(a) = \lambda \cdot \mathrm{Latency}_t(a) + (1-\lambda) \cdot \mathrm{QualityLoss}_t(a)

and seek to minimize t=1Tt(at)\sum_{t=1}^T \ell_t(a_t).

Oracle and Regret

Define the per-step oracle action:

at=argminaAt(a)a_t^* = \arg\min_{a \in \mathcal{A}} \ell_t(a)

and dynamic regret:

RTdyn=t=1Tt(at)t=1Tt(at)R_T^{dyn} = \sum_{t=1}^T \ell_t(a_t) - \sum_{t=1}^T \ell_t(a_t^*)

For stationary regimes, define the best fixed plan:

astat=argminaAt=1Tt(a)a^{stat} = \arg\min_{a \in \mathcal{A}} \sum_{t=1}^T \ell_t(a)

with static regret:

RTstat=t=1Tt(at)minaAt=1Tt(a)R_T^{stat} = \sum_{t=1}^T \ell_t(a_t) - \min_{a \in \mathcal{A}} \sum_{t=1}^T \ell_t(a)

Linear Contextual Bandit

Assume realizability: E[t(a)xt]=xtθa\mathbb{E}[\ell_t(a) \mid x_t] = x_t^\top \theta_a, with xt1\|x_t\| \le 1. LinUCB/linear Thompson sampling achieves sublinear regret of the form:

RTstat=O~ ⁣(dTA)R_T^{stat} = \tilde{O}\!\left(d\sqrt{T|\mathcal{A}|}\right)

where d=dim(xt)d = \dim(x_t) and O~\tilde{O} hides logarithmic factors.

Nonstationarity and Variation Budget

To capture policy churn, define a variation budget:

VT=t=2TsupaAE[t(a)xt]E[t1(a)xt1]V_T = \sum_{t=2}^T \sup_{a \in \mathcal{A}} \left| \mathbb{E}[\ell_t(a) \mid x_t] - \mathbb{E}[\ell_{t-1}(a) \mid x_{t-1}] \right|

Sliding-window or restart-based bandits can achieve dynamic regret bounds of the form:

RTdyn=O~ ⁣(TA+VT1/3T2/3)R_T^{dyn} = \tilde{O}\!\left(\sqrt{T|\mathcal{A}|} + V_T^{1/3} T^{2/3}\right)

under standard boundedness assumptions.

Connection to Margin-Based Routing

The threshold router πτ\pi_\tau from the amortization section is a finite policy class when τ\tau is discretized. This enables expert-style learning over {πτi}\{\pi_{\tau_i}\} with regret O(sqrtTlogPi)O(\\sqrt{T \\log |\\Pi|}) for Π|\Pi| candidate thresholds.

Bandit loop for LLM-QP

Bandit loop for LLM-QP. The planner maps runtime context to a plan, observes latency and quality proxies (optionally via shadow evaluation), and updates the routing policy online.

SECTION 5

Compiler Integration: Plan Selection in MLIR / StableHLO

Modern inference stacks already compile model graphs through MLIR and StableHLO. LLM-QP can therefore be implemented as a compiler pass that introduces multiple equivalent execution plans and selects among them using a cost model.

Logical vs Physical Plans

We define a logical operator: DecodeStep(query, constraint_state) which expresses the semantics of a single constrained decoding step. Physical implementations include:

  • Dense projection head
  • Sparse adjacency scoring
  • Amortized query update
  • Amortized update with rerank
  • Full recomputation (refinement)

Plan Expansion Pass

During compilation the planner expands the logical node into candidate physical plans:

DecodeStep
  -> DenseHead
  -> SparseHead
  -> AmortizedHead
  -> AmortizedHead + Rerank

Cost Model

Each plan is annotated with an estimated cost covering kernel latency estimates, memory bandwidth usage, refinement probability, and device utilization. The cost function follows:

C=λLatency+(1λ)QualityLossC = \lambda \cdot \mathrm{Latency} + (1-\lambda) \cdot \mathrm{QualityLoss}

Rewrite Rule

if cost(SparseHead) < cost(DenseHead):
    use SparseHead
else:
    use DenseHead

In practice this is implemented using MLIR pattern rewrites or XLA custom calls.

Runtime Adaptation

Runtime telemetry (margins, branching factor, cache state) feeds back into the cost model. The compiler therefore emits a small routing kernel that chooses the implementation at runtime. This hybrid compile-time / runtime planning architecture allows LLM-QP to reuse existing compiler infrastructure while enabling adaptive execution.

FIGURES

Visualizations

Amortized vs full recompute routing tradeoff

Amortized vs full recompute routing tradeoff.

Routing cost vs margin threshold

Routing cost vs margin threshold.

Synthetic margin vs branching factor

Synthetic margin vs branching factor.

APPENDIX

Planner Pseudocode

LinUCB for Plan Selection

Let A=AA = |\mathcal{A}| and feature dimension dd. Maintain per-action matrices VaRd×dV_a \in \mathbb{R}^{d \times d} and vectors baRdb_a \in \mathbb{R}^d initialized as Va=αIV_a = \alpha I, ba=0b_a = 0. At each step tt:

  1. Observe context xtx_t.
  2. For each action aa, estimate θ^a=Va1ba\hat{\theta}_a = V_a^{-1} b_a and compute:
    ucba=xtθ^a+βxtVa1xt\mathrm{ucb}_a = x_t^\top \hat{\theta}_a + \beta \sqrt{x_t^\top V_a^{-1} x_t}
  3. Choose at=argminaucbaa_t = \arg\min_a \mathrm{ucb}_a (loss-minimization).
  4. Execute plan ata_t, observe loss t(at)\ell_t(a_t).
  5. Update VatVat+xtxtV_{a_t} \leftarrow V_{a_t} + x_t x_t^\top and batbat+xtt(at)b_{a_t} \leftarrow b_{a_t} + x_t \ell_t(a_t).

Expert-Over-Thresholds Routing

Discretize thresholds {τ1,,τN}\{\tau_1, \dots, \tau_N\} and define policies πi=πτi\pi_i = \pi_{\tau_i}. Maintain weights wiw_i initialized uniformly. At each step tt:

  1. Observe context xtx_t and compute each expert’s suggested action at(i)=πi(xt)a_t^{(i)} = \pi_i(x_t).
  2. Sample an expert index ii proportional to weights (or pick argmaxwi\arg\max w_i for greedy).
  3. Execute at=at(i)a_t = a_t^{(i)}, observe loss t(at)\ell_t(a_t).
  4. Update weights using exponential weighting:
    wiwiexp(η^t,i)w_i \leftarrow w_i \exp(-\eta \hat{\ell}_{t,i})
    where ^t,i\hat{\ell}_{t,i} is an unbiased loss estimate.

This realizes regret O(TlogN)O(\sqrt{T \log N}) against the best threshold policy in hindsight under standard assumptions.

SYSTEM ARCHITECTURE

End-to-End System Architecture

End-to-end architecture of LLM-QP

End-to-end architecture of LLM-QP. A constrained decoding query is transformed into a constraint graph, analyzed by the planner, routed through the execution operator space, and refined via telemetry feedback.

The pipeline operates as follows:

  1. Query embedding produces a representation of the decoding state.
  2. The constraint graph defines valid token transitions.
  3. The planner selects an execution plan using contextual signals.
  4. The chosen operator executes the step.
  5. Telemetry feeds back to update routing decisions.
BENCHMARKS

Minimal Reproducible Benchmarks

We provide a portable benchmark harness (synthetic, deterministic) that instantiates the cost models and routing policies used throughout the paper. The suite outputs: (i) dense vs sparse crossover vs KK, (ii) amortized vs full recomputation crossover vs decode length, and (iii) router regret relative to an oracle policy (threshold router).

The harness now runs here, in the browser. Drag the parameters; the three figures below are recomputed live from the model’s own equations, and the router is a real LinUCB instance learning against the oracle.

Interactive · Claim 1

Sparse scoring wins while K < |V|

Constrained decoding scores only the valid tokens. The dense head costs 2d|V| no matter what; the sparse head costs 2dK, where K is the number of legal tokens right now. Move the model dimension and vocabulary and watch the crossover sit exactly at K = |V|.

Dense head (2d|V|)
Sparse head (2dK)
0.000.170.340.510.682101001.0k10k100k128kK = |V|K (valid tokens, log scale)latency (ms)
0.628 ms
dense step
0.051 ms
sparse @ K=256
12x
speedup @ K=256
K = 128k
crossover
Interactive · Claim 2

Amortize when the model is confident

If the top token is far ahead of the runner-up (a large margin), the next step probably has the same answer, so skip the full recompute. Refine only when the margin falls below the threshold τ. Drag τ: at τ = 0 the model trusts every step (cheapest, riskiest); at τ = 1 it refines always and collapses onto full recompute.

Full recompute every step
Amortized (refine when unsure)
04387130174165129192256decode step tcumulative latency (ms)
35%
refine rate
2.4x
speedup vs full
94.5 ms saved
over 256 steps
Interactive · Claim 3 · the live model

The router learns to route

A real LinUCB contextual bandit (Appendix A, running in your browser) sees each step’s context — how many tokens are valid, how confident the model is — and picks a plan. It is paid the true cost and updates. Watch cumulative regret against the hindsight oracle bend over and flatten: that bend is the convergence the paper claims, and nobody told the router the answer.

1,200 / 1,200 steps
Cumulative regret (lower is better; flat = converged)
0.00.71.42.12.913016019001200decode step tregret (ms)
Average cost per step (oracle = 1.00)
Oracle (hindsight)
1.000
LinUCB router
1.046
Static sparse
1.060
Static dense
6.660
1%
Dense chosen
92%
Sparse chosen
8%
Amortized chosen
2.6 ms
final regret

Deterministic given the seed. The router converges to within a few percent of the unbeatable oracle and leaves both fixed policies behind — and it learns that the conservative dense head is dominated, so it almost never picks it.

View PDFDownload PDF
← Back to ELI5 Overview

Related

Theory
LLM-QP
Direct system-paper relationship documenting the same research
Method
The Word Machine
Both optimize large language model architectures and inference
Method
SAGEN
Both develop architectural solutions for AI system optimization
Method
Stable Match
Both use algorithmic approaches to solve complex optimization problems

Need something like this built?

I design and ship AI tools, full-stack apps, and data pipelines — end to end, to production. Tell me the problem in a sentence; I'll give you an honest read on fit within a day.

Work with me →