FULL PAPER

LLM-QP: Query Planning for Large Language Model Inference

Jake Lawrence · Independent Researcher

SECTION 1

Plan Equivalence

Two inference plans $P_1, P_2$ are equivalent if they produce identical token sequences under identical model and constraint semantics.

Definition ( $\epsilon$ -Equivalence)

Two plans are $\epsilon$ -equivalent if logits differ by at most $\epsilon$ over the valid candidate set and produce identical decoded outputs.

Lemma: Dense and Sparse Equivalence

Let $\mathcal{A}(s)$ denote the valid token set. Dense masked scoring:

\tilde{\ell}(v) = \begin{cases} \sigma(h,v) & v \in \mathcal{A}(s) \\ -\infty & \text{otherwise} \end{cases}

Sparse scoring:

\ell(v) = \sigma(h,v), \quad v \in \mathcal{A}(s)

Then:

\arg\max_{v \in V} \tilde{\ell}(v) = \arg\max_{v \in \mathcal{A}(s)} \ell(v)

SECTION 2

Roofline Regime Analysis

Dense head compute:

W_{dense} \approx 2d|V|

Sparse head compute:

W_{sparse} \approx 2dK

Sparse dominates when:

K < |V|

SECTION 3

Amortized Query Hypothesis

We model decoder state as:

h_t = h^{stable} + \Delta h_t

Full transformer recomputation may provide limited incremental information when margins are large. Define margin:

m_t = \ell(v^{(1)}) - \ell(v^{(2)})

Routing decision — refine if:

m_t < \tau

Otherwise use amortized scoring.

SECTION 4

Planner Routing as a Contextual Bandit

At decoding step $t$ , the planner observes a context vector $x_t \in \mathcal{X}$ and selects an execution plan $a_t \in \mathcal{A}$ . We define a scalar loss:

\ell_t(a) = \lambda \cdot \mathrm{Latency}_t(a) + (1-\lambda) \cdot \mathrm{QualityLoss}_t(a)

and seek to minimize $\sum_{t=1}^T \ell_t(a_t)$ .

Oracle and Regret

Define the per-step oracle action:

a_t^* = \arg\min_{a \in \mathcal{A}} \ell_t(a)

and dynamic regret:

R_T^{dyn} = \sum_{t=1}^T \ell_t(a_t) - \sum_{t=1}^T \ell_t(a_t^*)

For stationary regimes, define the best fixed plan:

a^{stat} = \arg\min_{a \in \mathcal{A}} \sum_{t=1}^T \ell_t(a)

with static regret:

R_T^{stat} = \sum_{t=1}^T \ell_t(a_t) - \min_{a \in \mathcal{A}} \sum_{t=1}^T \ell_t(a)

Linear Contextual Bandit

Assume realizability: $\mathbb{E}[\ell_t(a) \mid x_t] = x_t^\top \theta_a$ , with $\|x_t\| \le 1$ . LinUCB/linear Thompson sampling achieves sublinear regret of the form:

R_T^{stat} = \tilde{O}\!\left(d\sqrt{T|\mathcal{A}|}\right)

where $d = \dim(x_t)$ and $\tilde{O}$ hides logarithmic factors.

Nonstationarity and Variation Budget

To capture policy churn, define a variation budget:

V_T = \sum_{t=2}^T \sup_{a \in \mathcal{A}} \left| \mathbb{E}[\ell_t(a) \mid x_t] - \mathbb{E}[\ell_{t-1}(a) \mid x_{t-1}] \right|

Sliding-window or restart-based bandits can achieve dynamic regret bounds of the form:

R_T^{dyn} = \tilde{O}\!\left(\sqrt{T|\mathcal{A}|} + V_T^{1/3} T^{2/3}\right)

under standard boundedness assumptions.

Connection to Margin-Based Routing

The threshold router $\pi_\tau$ from the amortization section is a finite policy class when $\tau$ is discretized. This enables expert-style learning over $\{\pi_{\tau_i}\}$ with regret $O(\\sqrt{T \\log |\\Pi|})$ for $|\Pi|$ candidate thresholds.

Bandit loop for LLM-QP. The planner maps runtime context to a plan, observes latency and quality proxies (optionally via shadow evaluation), and updates the routing policy online.

SECTION 5

Compiler Integration: Plan Selection in MLIR / StableHLO

Modern inference stacks already compile model graphs through MLIR and StableHLO. LLM-QP can therefore be implemented as a compiler pass that introduces multiple equivalent execution plans and selects among them using a cost model.

Logical vs Physical Plans

We define a logical operator: DecodeStep(query, constraint_state) which expresses the semantics of a single constrained decoding step. Physical implementations include:

Dense projection head
Sparse adjacency scoring
Amortized query update
Amortized update with rerank
Full recomputation (refinement)

Plan Expansion Pass

During compilation the planner expands the logical node into candidate physical plans:

DecodeStep
  -> DenseHead
  -> SparseHead
  -> AmortizedHead
  -> AmortizedHead + Rerank

Cost Model

Each plan is annotated with an estimated cost covering kernel latency estimates, memory bandwidth usage, refinement probability, and device utilization. The cost function follows:

C = \lambda \cdot \mathrm{Latency} + (1-\lambda) \cdot \mathrm{QualityLoss}

Rewrite Rule

if cost(SparseHead) < cost(DenseHead):
    use SparseHead
else:
    use DenseHead

In practice this is implemented using MLIR pattern rewrites or XLA custom calls.

Runtime Adaptation

Runtime telemetry (margins, branching factor, cache state) feeds back into the cost model. The compiler therefore emits a small routing kernel that chooses the implementation at runtime. This hybrid compile-time / runtime planning architecture allows LLM-QP to reuse existing compiler infrastructure while enabling adaptive execution.

FIGURES

Visualizations

Amortized vs full recompute routing tradeoff.

Routing cost vs margin threshold.

Synthetic margin vs branching factor.

APPENDIX

Planner Pseudocode

LinUCB for Plan Selection

Let $A = |\mathcal{A}|$ and feature dimension $d$ . Maintain per-action matrices $V_a \in \mathbb{R}^{d \times d}$ and vectors $b_a \in \mathbb{R}^d$ initialized as $V_a = \alpha I$ , $b_a = 0$ . At each step $t$ :

Observe context $x_t$ .
For each action $a$ , estimate $\hat{\theta}_a = V_a^{-1} b_a$ and compute:
$\mathrm{ucb}_a = x_t^\top \hat{\theta}_a + \beta \sqrt{x_t^\top V_a^{-1} x_t}$
Choose $a_t = \arg\min_a \mathrm{ucb}_a$ (loss-minimization).
Execute plan $a_t$ , observe loss $\ell_t(a_t)$ .
Update $V_{a_t} \leftarrow V_{a_t} + x_t x_t^\top$ and $b_{a_t} \leftarrow b_{a_t} + x_t \ell_t(a_t)$ .

Expert-Over-Thresholds Routing

Discretize thresholds $\{\tau_1, \dots, \tau_N\}$ and define policies $\pi_i = \pi_{\tau_i}$ . Maintain weights $w_i$ initialized uniformly. At each step $t$ :

Observe context $x_t$ and compute each expert’s suggested action $a_t^{(i)} = \pi_i(x_t)$ .
Sample an expert index $i$ proportional to weights (or pick $\arg\max w_i$ for greedy).
Execute $a_t = a_t^{(i)}$ , observe loss $\ell_t(a_t)$ .
Update weights using exponential weighting:
$w_i \leftarrow w_i \exp(-\eta \hat{\ell}_{t,i})$
where $\hat{\ell}_{t,i}$ is an unbiased loss estimate.

This realizes regret $O(\sqrt{T \log N})$ against the best threshold policy in hindsight under standard assumptions.

SYSTEM ARCHITECTURE

End-to-End System Architecture

End-to-end architecture of LLM-QP. A constrained decoding query is transformed into a constraint graph, analyzed by the planner, routed through the execution operator space, and refined via telemetry feedback.

The pipeline operates as follows:

Query embedding produces a representation of the decoding state.
The constraint graph defines valid token transitions.
The planner selects an execution plan using contextual signals.
The chosen operator executes the step.
Telemetry feeds back to update routing decisions.

BENCHMARKS

Minimal Reproducible Benchmarks

We provide a portable benchmark harness (synthetic, deterministic) that instantiates the cost models and routing policies used throughout the paper. The suite outputs: (i) dense vs sparse crossover vs $K$ , (ii) amortized vs full recomputation crossover vs decode length, and (iii) router regret relative to an oracle policy (threshold router).

The harness now runs here, in the browser. Drag the parameters; the three figures below are recomputed live from the model’s own equations, and the router is a real LinUCB instance learning against the oracle.

Interactive · Claim 1

Sparse scoring wins while K < |V|

Constrained decoding scores only the valid tokens. The dense head costs 2d|V| no matter what; the sparse head costs 2dK, where K is the number of legal tokens right now. Move the model dimension and vocabulary and watch the crossover sit exactly at K = |V|.

Model dim d4,096

Vocabulary |V|128,256

Dense head (2d|V|)

Sparse head (2dK)

0.628 ms

dense step

0.051 ms

sparse @ K=256

12x

speedup @ K=256

K = 128k

crossover

Interactive · Claim 2

Amortize when the model is confident

If the top token is far ahead of the runner-up (a large margin), the next step probably has the same answer, so skip the full recompute. Refine only when the margin falls below the threshold τ. Drag τ: at τ = 0 the model trusts every step (cheapest, riskiest); at τ = 1 it refines always and collapses onto full recompute.

Refine threshold τ0.35

Full recompute every step

Amortized (refine when unsure)

35%

refine rate

2.4x

speedup vs full

94.5 ms saved

over 256 steps

Interactive · Claim 3 · the live model

The router learns to route

A real LinUCB contextual bandit (Appendix A, running in your browser) sees each step’s context — how many tokens are valid, how confident the model is — and picks a plan. It is paid the true cost and updates. Watch cumulative regret against the hindsight oracle bend over and flatten: that bend is the convergence the paper claims, and nobody told the router the answer.

1,200 / 1,200 steps

Cumulative regret (lower is better; flat = converged)

Average cost per step (oracle = 1.00)

Oracle (hindsight)

1.000

LinUCB router

1.046

Static sparse

1.060

Static dense

6.660

Dense chosen

92%

Sparse chosen

Amortized chosen

2.6 ms

final regret

Deterministic given the seed. The router converges to within a few percent of the unbeatable oracle and leaves both fixed policies behind — and it learns that the conservative dense head is dominated, so it almost never picks it.

View PDF Download PDF

← Back to ELI5 Overview

LLM-QP: Query Planning for Large Language Model Inference

Plan Equivalence

Definition ( $\epsilon$ -Equivalence)

Lemma: Dense and Sparse Equivalence

Roofline Regime Analysis

Amortized Query Hypothesis

Planner Routing as a Contextual Bandit

Oracle and Regret

Linear Contextual Bandit

Nonstationarity and Variation Budget

Connection to Margin-Based Routing

Compiler Integration: Plan Selection in MLIR / StableHLO

Logical vs Physical Plans

Plan Expansion Pass

Cost Model

Rewrite Rule

Runtime Adaptation

Visualizations

Planner Pseudocode

LinUCB for Plan Selection

Expert-Over-Thresholds Routing

End-to-End System Architecture

Minimal Reproducible Benchmarks

Sparse scoring wins while K < |V|

Amortize when the model is confident

The router learns to route

Related

Need something like this built?

LLM-QP: Query Planning for Large Language Model Inference

Plan Equivalence

Definition (ϵ\epsilonϵ-Equivalence)

Lemma: Dense and Sparse Equivalence

Roofline Regime Analysis

Amortized Query Hypothesis

Planner Routing as a Contextual Bandit

Oracle and Regret

Linear Contextual Bandit

Nonstationarity and Variation Budget

Connection to Margin-Based Routing

Compiler Integration: Plan Selection in MLIR / StableHLO

Logical vs Physical Plans

Plan Expansion Pass

Cost Model

Rewrite Rule

Runtime Adaptation

Visualizations

Planner Pseudocode

LinUCB for Plan Selection

Expert-Over-Thresholds Routing

End-to-End System Architecture

Minimal Reproducible Benchmarks

Sparse scoring wins while K < |V|

Amortize when the model is confident

The router learns to route

Related

Need something like this built?

Definition ( $\epsilon$ -Equivalence)