Local AI Infrastructure Notes (8/15) — Local LLM Acceleration and Quantization Deep Dive

Same model, same hardware — one setup squeezes 200 tok/s, another tops out at 30. The difference is the algorithm.

In Q1 2026, three techniques redrew the limit of LLM inference speed. DFlash (Z Lab, Feb 2026 — block-diffusion speculative decoding), DDTree (Ringel & Romano, Apr 2026 — tree-attention verification), TurboQuant (Google Research, ICLR 2026 — 3-bit lossless KV cache quantization). They don't interfere with each other — applied together, the speedups compound.

This article walks through all three from first-party sources (arXiv, GitHub, official announcements), starting from prefill/decode/KV cache fundamentals through real-world deployment status. Developer-targeted, but it answers questions like "Why is the M4 Pro nearly 2× faster at inference than the base M4?"

ํ•ต์‹ฌ ์š”์•ฝ

  • Prefill = compute-bound, Decode = memory-bandwidth-bound. Different games, different optimizations.
  • DFlash: block diffusion drafts K tokens at once → 6× speedup, 2.5× over EAGLE-3.
  • DDTree: builds a draft tree from diffusion distributions → MATH 7.5×, HumanEval 8.2×.
  • TurboQuant: random rotation + QJL → 3.5-bit lossless KV cache. 8× on H100.
  • All three are orthogonal. Theoretical compound 8~10×.

1. Fundamentals — Prefill and Decode Are Different Games

LLM inference has two phases. Many articles treat them as one. They aren't.

Prefill — Processing the Input

The full prompt is processed in parallel. K and V values for every input token are computed at once, building the KV cache. Massive matrix multiplications max out GPU tensor cores (or Apple M5 Neural Accelerators) — compute-bound.

H100 with a 10K-token prompt: prefill takes 200~400ms. M5 with a 14B dense model: under 10s. 30B MoE: under 3s.

The metric for this phase is TTFT (Time To First Token).

Decode — Generating the Output

Tokens are produced one at a time, autoregressively. Every step has to read the full KV cache and the full model weights from memory. Computation is small, but memory access becomes the bottleneck — memory-bandwidth-bound.

The metric here is TPOT (Time Per Output Token), or its inverse, tok/s. Typically 30~150 tok/s.

The Role of the KV Cache

Without it, every new token at position t recomputes attention over all previous tokens — O(n²). With the cache, only the new token's K and V are appended — O(n). So the KV cache is mandatory for any serious inference.

The problem is size. A 70B model with 200K context = 40~80 GB KV cache — sometimes larger than the model weights themselves.

Memory Bandwidth Decides — Apple Silicon Comparison

Chip Bandwidth
M3 base 100 GB/s
M4 base 120 GB/s
M5 base 153 GB/s
M3 Pro 150 GB/s
M4 Pro 273 GB/s (2.3× over M4 base)
M3 Max 400 GB/s
M4 Max 546 GB/s
M5 Max ~600 GB/s

The M4 Pro is nearly 2× faster at inference than the base M4 because its memory bandwidth is 2.3× higher. Decode is memory-bound, so this is the expected outcome.

If you can read the same KV cache faster, you produce more tokens per second. That's the entire game.


2. DFlash — Block-Diffusion Speculative Decoding

Speculative Decoding First

Standard decode produces one token, verifies, produces the next. Speculative decoding: 1. A draft model (smaller) guesses K tokens 2. A target model (larger) verifies all K in one pass 3. Accept the matching prefix; restart from the first mismatch

One forward pass of the big model produces K tokens — theoretical ceiling K×.

EAGLE-3's Bottleneck

EAGLE-3 uses an autoregressive draft model. The draft step itself becomes a bottleneck.

DFlash's Solution

Replace the autoregressive draft with a block diffusion drafter that generates K tokens in one parallel forward pass.

Technical specifics: - Block diffusion drafter: K=4~8 tokens denoised in one pass - Feature extraction: pulls context features from deep layers of the target model - KV injection: target features injected into every layer of the draft (EAGLE-3 only injects into a single layer)

Verified Performance

  • Qwen3-8B (greedy): 6× lossless speedup
  • vs EAGLE-3: 2.5× faster
  • Qwen3.6-27B-DFlash on RTX 3090: 207 tok/s @ 256K context
  • Consistent across GSM8K, MATH-500, AIME24/25, HumanEval, chat
  • Speedup holds under temperature sampling and reasoning modes

Availability

  • Paper: arXiv:2602.06036 (Z Lab — Jian Chen, Yesheng Liang, Zhijian Liu, Feb 2026)
  • Code: github.com/z-lab/dflash
  • Models: Hugging Face z-lab/dflash collection (Qwen3-4B/8B, Qwen3-Coder-30B-A3B, LLaMA-3.1-8B variants)
  • Integrations: SGLang (production), Transformers (experimental), vLLM (in progress), llama.cpp (community discussion)

3. DDTree — Extending Diffusion Drafts Into a Tree

The Information DFlash Discards

DFlash produces a single token sequence quickly. But the draft model knows distributions over multiple possibilities at each position — DFlash throws that away.

DDTree's Idea

Use the draft model's per-position probability distributions to construct a tree, and have the target model verify the whole tree in one forward pass via tree attention.

The 4-Step Process

  1. Draft model generates per-position distributions for the next block
  2. Construct a tree within a fixed node budget (high-probability branches first)
  3. Target model verifies the entire tree via tree attention in a single forward pass
  4. Walk down matching branches; restart from the first mismatch

Verified Performance

Benchmark Speedup
MATH-500 7.5×
HumanEval (T=0.0) 8.2×
GSM8K 6.6×
AIME'24 7.3×
AIME'25 7.2×
MBPP 6.4×
LiveCode 6.8×
SWE Lite 4.3×
MT 4.2×
Alpaca 3.3×

→ Effects are largest on deterministic tasks like math and code (branches converge fast).

Tradeoff

Larger node budgets increase acceptance length, but verification cost dominates past a point — speedup plateaus. Per-task calibration required.

Availability


4. TurboQuant — KV Cache Down to 3 Bits

Why KV Cache Quantization Matters

70B model + 200K context = 40~80 GB KV cache. Compressing model weights alone isn't enough — the KV cache is what eats memory bandwidth and capacity. The cache itself has to be compressed.

Limits of Existing Quantization

  • INT8/FP8: nearly lossless, but capped at compression
  • INT4/FP4: 4× compression at 1~3% accuracy cost
  • GPTQ-style vector quantization: requires calibration data

TurboQuant's Approach

Apply a random rotation to input vectors so all coordinates carry similar (beta) distributions, then use a standard scalar quantizer uniformly.

Two-Stage Algorithm

  1. Rotation + scalar quantization: input → random rotation → MSE-optimal scalar quantizer per coordinate
  2. QJL error correction: 1 extra bit applies the Quantized Johnson-Lindenstrauss transform → corrects residual + removes bias

Verified Performance (Paper)

  • 3.5 bits/channel: absolute quality neutrality (no degradation)
  • 2.5 bits/channel: marginal quality drop only
  • 3-bit KV cache: training-free, data-oblivious, lossless
  • 4-bit on H100: over 32-bit unquantized keys
  • Within 2.7× of the information-theoretic lower bound

TurboQuant+ — Apple Silicon Variant

Base TurboQuant + PolarQuant + Walsh-Hadamard rotation. Optimized for Apple Silicon.

  • 3.8~6.4× compression
  • M5 Max 128K context: prefill on par with q8_0, decode at 0.9× (effectively equivalent)

No Training Required

This is the real differentiator.

"TurboQuant can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy."

Applied at inference time. No retraining.

Availability


5. The Three Are Orthogonal — Stack Them

They Attack Different Things

Technique Target Layer
DFlash / DDTree Tokens per decode pass (1 → K) Inference algorithm
TurboQuant KV cache memory (FP16 → 3-bit) Memory representation
MLX UMA CPU↔GPU copy elimination Hardware acceleration

They don't interfere. They can stack.

Theoretical Compounding (Approximation)

Baseline: Qwen3.6-27B on M4 Max 64GB, FP16, normal decode.

Layer tok/s (estimate)
Baseline (llama.cpp) ~30
+ MLX ~50 (1.6×)
+ TurboQuant 3-bit ~70 (1.4× more, KV memory pressure relieved)
+ DFlash speculative ~150 (2.1× more)
+ DDTree (code/math only) ~250 (1.7× more)

→ Theoretical compound 8~10×. Real measurements depend on task, model, hardware. The table is theoretical, not measured.

By Use Case

  • Chat / summarization (short streaming): DFlash alone is enough
  • Coding / math (branchy outputs): DDTree
  • Long context (RAG, 100K+): TurboQuant required
  • Agents (repeated system prompts): oMLX SSD cache + TurboQuant

6. Tool Integration Status (2026-04-29)

Tool DFlash DDTree TurboQuant
SGLang ✅ Production In progress Experimental
vLLM In progress In progress ✅ (0xSero fork)
Transformers Experimental
llama.cpp Community Partial (PolarQuant)
MLX (Apple) Not yet Not yet TurboQuant+ supported
oMLX Not yet Not yet Integration considered
Ollama 0.19 Not yet NVFP4 (different family)

What to Do Today

  • NVIDIA GPU users (RTX 3090/4090, H100): Try DFlash + TurboQuant immediately (vLLM, SGLang)
  • Apple Silicon (M3/M4/M5): Try TurboQuant+ via tonbistudio/turboquant-pytorch. Wait for MLX ports of DFlash.
  • Mac + simple use: Stay with oMLX defaults. Monitor integration progress.

7. Caveats

"8× Faster" Is an Illusion

The 8~10× compounding above is theoretical. In practice: - DFlash speedup depends on model, hardware, temperature — typical range 1.5~6× - TurboQuant's complete losslessness is at 3.5 bits. 3 bits is near-lossless; some tasks may show minor degradation - Tool integration is incomplete — staged adoption only, today

Apple Silicon Reality Check

For oMLX/MLX users, the only stack-ready acceleration today is TurboQuant+. MLX ports of DFlash and DDTree are in progress with no firm dates.

Domain Sensitivity

3-bit quantization preserves accuracy theoretically, but medical/financial domains warrant additional verification. General chat and coding are essentially unaffected.


Bottom Line

The three techniques in one line each:

Technique One-line Speedup
DFlash Block diffusion drafts K tokens at once
DDTree Diffusion distribution → tree + tree attention 7.5× (math)
TurboQuant Random rotation + QJL → 3-bit KV cache 8× (H100)

The single takeaway: "Prefill is compute, decode is memory. The three techniques attack the memory side orthogonally."

What you can do today depends on hardware. NVIDIA GPU users can stack DFlash + TurboQuant via vLLM/SGLang now. Apple Silicon users will get the same once MLX ports land in 6~8 weeks. Until then, run oMLX defaults and test stack on a vLLM/SGLang environment.


First-Party Sources

  • DFlash paper: arxiv.org/abs/2602.06036, Z Lab
  • DDTree paper: arxiv.org/abs/2604.12989, GitHub liranringel/ddtree
  • TurboQuant paper: arxiv.org/abs/2504.19874, Google Research blog
  • Apple ML Research M5: machinelearning.apple.com/research/exploring-llms-mlx-m5
  • Prefill/Decode foundations: morphllm.com/llm-inference, morphllm.com/kv-cache-explained

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System