FULL PAPER

LLM-QP: Query Planning for Large Language Model Inference

Jake Lawrence  ·  Independent Researcher

SECTION 1

Plan Equivalence

Two inference plans P1,P2P_1, P_2 are equivalent if they produce identical token sequences under identical model and constraint semantics.

Definition (ϵ\epsilon-Equivalence)

Two plans are ϵ\epsilon-equivalent if logits differ by at most ϵ\epsilon over the valid candidate set and produce identical decoded outputs.

Lemma: Dense and Sparse Equivalence

Let A(s)\mathcal{A}(s) denote the valid token set. Dense masked scoring:

~(v)={σ(h,v)vA(s)otherwise\tilde{\ell}(v) = \begin{cases} \sigma(h,v) & v \in \mathcal{A}(s) \\ -\infty & \text{otherwise} \end{cases}

Sparse scoring:

(v)=σ(h,v),vA(s)\ell(v) = \sigma(h,v), \quad v \in \mathcal{A}(s)

Then:

argmaxvV~(v)=argmaxvA(s)(v)\arg\max_{v \in V} \tilde{\ell}(v) = \arg\max_{v \in \mathcal{A}(s)} \ell(v)
SECTION 2

Roofline Regime Analysis

Dense head compute:

Wdense2dVW_{dense} \approx 2d|V|

Sparse head compute:

Wsparse2dKW_{sparse} \approx 2dK

Sparse dominates when:

K<VK < |V|
SECTION 3

Amortized Query Hypothesis

We model decoder state as:

ht=hstable+Δhth_t = h^{stable} + \Delta h_t

Full transformer recomputation may provide limited incremental information when margins are large. Define margin:

mt=(v(1))(v(2))m_t = \ell(v^{(1)}) - \ell(v^{(2)})

Routing decision — refine if:

mt<τm_t < \tau

Otherwise use amortized scoring.

SECTION 4

Planner Routing as a Contextual Bandit

At decoding step tt, the planner observes a context vector xtXx_t \in \mathcal{X} and selects an execution plan atAa_t \in \mathcal{A}. We define a scalar loss:

t(a)=λLatencyt(a)+(1λ)QualityLosst(a)\ell_t(a) = \lambda \cdot \mathrm{Latency}_t(a) + (1-\lambda) \cdot \mathrm{QualityLoss}_t(a)

and seek to minimize t=1Tt(at)\sum_{t=1}^T \ell_t(a_t).

Oracle and Regret

Define the per-step oracle action:

at=argminaAt(a)a_t^* = \arg\min_{a \in \mathcal{A}} \ell_t(a)

and dynamic regret:

RTdyn=t=1Tt(at)t=1Tt(at)R_T^{dyn} = \sum_{t=1}^T \ell_t(a_t) - \sum_{t=1}^T \ell_t(a_t^*)

For stationary regimes, define the best fixed plan:

astat=argminaAt=1Tt(a)a^{stat} = \arg\min_{a \in \mathcal{A}} \sum_{t=1}^T \ell_t(a)

with static regret:

RTstat=t=1Tt(at)minaAt=1Tt(a)R_T^{stat} = \sum_{t=1}^T \ell_t(a_t) - \min_{a \in \mathcal{A}} \sum_{t=1}^T \ell_t(a)

Linear Contextual Bandit

Assume realizability: E[t(a)xt]=xtθa\mathbb{E}[\ell_t(a) \mid x_t] = x_t^\top \theta_a, with xt1\|x_t\| \le 1. LinUCB/linear Thompson sampling achieves sublinear regret of the form:

RTstat=O~ ⁣(dTA)R_T^{stat} = \tilde{O}\!\left(d\sqrt{T|\mathcal{A}|}\right)

where d=dim(xt)d = \dim(x_t) and O~\tilde{O} hides logarithmic factors.

Nonstationarity and Variation Budget

To capture policy churn, define a variation budget:

VT=t=2TsupaAE[t(a)xt]E[t1(a)xt1]V_T = \sum_{t=2}^T \sup_{a \in \mathcal{A}} \left| \mathbb{E}[\ell_t(a) \mid x_t] - \mathbb{E}[\ell_{t-1}(a) \mid x_{t-1}] \right|

Sliding-window or restart-based bandits can achieve dynamic regret bounds of the form:

RTdyn=O~ ⁣(TA+VT1/3T2/3)R_T^{dyn} = \tilde{O}\!\left(\sqrt{T|\mathcal{A}|} + V_T^{1/3} T^{2/3}\right)

under standard boundedness assumptions.

Connection to Margin-Based Routing

The threshold router πτ\pi_\tau from the amortization section is a finite policy class when τ\tau is discretized. This enables expert-style learning over {πτi}\{\pi_{\tau_i}\} with regret O(sqrtTlogPi)O(\\sqrt{T \\log |\\Pi|}) for Π|\Pi| candidate thresholds.

Bandit loop for LLM-QP

Bandit loop for LLM-QP. The planner maps runtime context to a plan, observes latency and quality proxies (optionally via shadow evaluation), and updates the routing policy online.

SECTION 5

Compiler Integration: Plan Selection in MLIR / StableHLO

Modern inference stacks already compile model graphs through MLIR and StableHLO. LLM-QP can therefore be implemented as a compiler pass that introduces multiple equivalent execution plans and selects among them using a cost model.

Logical vs Physical Plans

We define a logical operator: DecodeStep(query, constraint_state) which expresses the semantics of a single constrained decoding step. Physical implementations include:

  • Dense projection head
  • Sparse adjacency scoring
  • Amortized query update
  • Amortized update with rerank
  • Full recomputation (refinement)

Plan Expansion Pass

During compilation the planner expands the logical node into candidate physical plans:

DecodeStep
  -> DenseHead
  -> SparseHead
  -> AmortizedHead
  -> AmortizedHead + Rerank

Cost Model

Each plan is annotated with an estimated cost covering kernel latency estimates, memory bandwidth usage, refinement probability, and device utilization. The cost function follows:

C=λLatency+(1λ)QualityLossC = \lambda \cdot \mathrm{Latency} + (1-\lambda) \cdot \mathrm{QualityLoss}

Rewrite Rule

if cost(SparseHead) < cost(DenseHead):
    use SparseHead
else:
    use DenseHead

In practice this is implemented using MLIR pattern rewrites or XLA custom calls.

Runtime Adaptation

Runtime telemetry (margins, branching factor, cache state) feeds back into the cost model. The compiler therefore emits a small routing kernel that chooses the implementation at runtime. This hybrid compile-time / runtime planning architecture allows LLM-QP to reuse existing compiler infrastructure while enabling adaptive execution.

FIGURES

Visualizations

Amortized vs full recompute routing tradeoff

Amortized vs full recompute routing tradeoff.

Routing cost vs margin threshold

Routing cost vs margin threshold.

Synthetic margin vs branching factor

Synthetic margin vs branching factor.

APPENDIX

Planner Pseudocode

LinUCB for Plan Selection

Let A=AA = |\mathcal{A}| and feature dimension dd. Maintain per-action matrices VaRd×dV_a \in \mathbb{R}^{d \times d} and vectors baRdb_a \in \mathbb{R}^d initialized as Va=αIV_a = \alpha I, ba=0b_a = 0. At each step tt:

  1. Observe context xtx_t.
  2. For each action aa, estimate θ^a=Va1ba\hat{\theta}_a = V_a^{-1} b_a and compute:
    ucba=xtθ^a+βxtVa1xt\mathrm{ucb}_a = x_t^\top \hat{\theta}_a + \beta \sqrt{x_t^\top V_a^{-1} x_t}
  3. Choose at=argminaucbaa_t = \arg\min_a \mathrm{ucb}_a (loss-minimization).
  4. Execute plan ata_t, observe loss t(at)\ell_t(a_t).
  5. Update VatVat+xtxtV_{a_t} \leftarrow V_{a_t} + x_t x_t^\top and batbat+xtt(at)b_{a_t} \leftarrow b_{a_t} + x_t \ell_t(a_t).

Expert-Over-Thresholds Routing

Discretize thresholds {τ1,,τN}\{\tau_1, \dots, \tau_N\} and define policies πi=πτi\pi_i = \pi_{\tau_i}. Maintain weights wiw_i initialized uniformly. At each step tt:

  1. Observe context xtx_t and compute each expert’s suggested action at(i)=πi(xt)a_t^{(i)} = \pi_i(x_t).
  2. Sample an expert index ii proportional to weights (or pick argmaxwi\arg\max w_i for greedy).
  3. Execute at=at(i)a_t = a_t^{(i)}, observe loss t(at)\ell_t(a_t).
  4. Update weights using exponential weighting:
    wiwiexp(η^t,i)w_i \leftarrow w_i \exp(-\eta \hat{\ell}_{t,i})
    where ^t,i\hat{\ell}_{t,i} is an unbiased loss estimate.

This realizes regret O(TlogN)O(\sqrt{T \log N}) against the best threshold policy in hindsight under standard assumptions.

SYSTEM ARCHITECTURE

End-to-End System Architecture

End-to-end architecture of LLM-QP

End-to-end architecture of LLM-QP. A constrained decoding query is transformed into a constraint graph, analyzed by the planner, routed through the execution operator space, and refined via telemetry feedback.

The pipeline operates as follows:

  1. Query embedding produces a representation of the decoding state.
  2. The constraint graph defines valid token transitions.
  3. The planner selects an execution plan using contextual signals.
  4. The chosen operator executes the step.
  5. Telemetry feeds back to update routing decisions.
BENCHMARKS

Minimal Reproducible Benchmarks

We provide a portable benchmark harness (synthetic, deterministic) that instantiates the cost models and routing policies used throughout the paper. The suite outputs: (i) dense vs sparse crossover vs KK, (ii) amortized vs full recomputation crossover vs decode length, and (iii) router regret relative to an oracle policy (threshold router).

Dense vs sparse crossover

Dense vs sparse crossover (synthetic roofline + refinement).

Cumulative amortization crossover

Cumulative amortization crossover over sequence length.

Cumulative router regret vs oracle

Cumulative router regret vs oracle (threshold policy).

View PDFDownload PDF
← Back to ELI5 Overview
2026 Jake Lawrence
One question, asked different ways.