We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the linear layers and the KV cache of large language and diffusion transformers.
HyperQuant combines four well studied together into a single construction: (i) a per-tile Walsh-Hadamard rotation that whitens weights and activations to approximate multivariate-Gaussian statistics, (ii) a low-dimensional optimal lattice (E8, D4, A2, or Z) quantization, (iii) a near-entropy-optimal variable-length Rice coding scheme over lattice indices, and (iv) various bias-correction methods for the KV cache, that keep the reconstruction unbiased under inner products, preserving attention semantics.
How HyperQuant works
HyperQuant is built as one encode/decode pipeline that can serve two very different workloads: linear weights, which are compressed offline once, and KV cache vectors, which are produced online during generation. The common path first uses a randomized Hadamard transform to make each tile look closer to an isotropic Gaussian, then rescales it to a lattice operating point, quantizes it, strips the lattice code into compact integer symbols, and Rice-codes those symbols into a bitstream. On decode, the process runs backward into the low-precision MMA input; the KV-cache path adds bias-correction hooks, such as rotation and subtractive dither, because attention is especially sensitive to biased inner products.
Experiments and results
The experiments ask whether the same compression idea holds across three settings: Llama weight compression, online KV-cache compression, and a non-LLM stress test on the LTX-2 video DiT. For the language-model experiments, the paper reports pseudo-quantization perplexity on WikiText-2, with the KV comparison also matching the Qwen2.5-7B-Instruct-1M setup used by OCTOPUS. For video, the paper applies HyperQuant to the LTX-2-19B linear weights at 4 bps and reports image/video quality metrics over a 32-prompt suite, while noting that wall-clock acceleration still depends on a fused kernel implementation.
| Setting | Rate | Criterion | HyperQuant | Reference |
|---|---|---|---|---|
| Weights + KV cache, INT8 MMA | 4 bps | PPL ↓ | 7.50 (+0.47%) | 7.16 (bf16) |
| Weights | 4 bps | ΔPPL% ↓ | +3.8% | +6.4% (HIGGS) |
| Weights | 3 bps | ΔPPL% ↓ | +22.1% | +33% (HIGGS) |
| KV cache | 2 bps | ΔPPL% ↓ | +7.4% | +34.7% (OCTOPUS) |
| KV cache | 2 bps | Compression ↑ | 6.4x | 3.0x (TurboQuant) |
| KV cache | 1.7 bps | ΔPPL% ↓ | +26.9% | - |
| LTX-2 video | 4 bps | LPIPS ↓ | 0.20-0.21 | 0 (bf16) |
Weights: perplexity vs. bitrate
This is the direct weight-only comparison against HIGGS on Llama-3.1-8B-Instruct evaluated on WikiText-2. HIGGS uses fixed-rate finite codebooks; HyperQuant uses a lattice plus variable-length Rice coding, so it can hit the same bps targets while spending fewer bits on likely symbols and allowing rare tail values to cost more. The important read is that every HyperQuant lattice beats HIGGS at the tested rates from 3 to 5 bps, with the advantage most visible in the more aggressive low-bit regime.
Weights: why the gap appears
The paper explains the HIGGS gap through weight SNR, which is the quantity that best predicts perplexity damage for this kind of post-training weight quantization. HyperQuant gains from two places at once: Rice coding recovers entropy slack that fixed-rate codebooks leave on the table, and the unbounded lattice codebook avoids the tail-cell problem of finite codebooks. Higher-dimensional lattices such as D4 and E8 add another source of gain because their Voronoi cells have better second moments, which is why the SNR curve stays above HIGGS across the range.
KV cache: HyperQuant vs. TurboQuant / OCTOPUS
The KV-cache comparison is stricter than a simple quality table because the methods use protection tricks that change the effective compression rate. The paper matches the 32-token residual window used by OCTOPUS and reports true KV compression after accounting for protected tiles and the residual window. Under that matched setup, HyperQuant improves both axes at the key low-bit points: at 2 bps it brings the WikiText-2 perplexity increase down to +7.4% while reaching 6.4x KV compression, compared with OCTOPUS at +34.7% and 2.9x.
| Bits | Codec | Corr. | Residual window | W2 Delta% ↓ | C4 Delta% ↓ | KVx ↑ |
|---|---|---|---|---|---|---|
| 4 | TurboQuant-MSE | none | 32 | +3.1 | +1.7 | 2.2 |
| TurboQuant-QJL | qjl | 32 | +8.0 | +7.9 | 2.2 | |
| OCTOPUS | none | 32 | +2.7 | +1.5 | 2.2 | |
| OCTOPUS-QJL | qjl | 32 | +2.7 | +1.5 | 2.0 | |
| HyperQuant | none | - | +0.8 | +1.0 | 3.7 | |
| HyperQuant | qjl | - | +1.4 | +1.0 | 3.6 | |
| HyperQuant | none | 32 | +0.1 | +0.2 | 3.6 | |
| HyperQuant | qjl | 32 | +0.2 | +0.3 | 3.5 | |
| 3 | TurboQuant-MSE | none | 32 | +8.6 | +8.3 | 2.6 |
| TurboQuant-QJL | qjl | 32 | +50.4 | +59.9 | 2.5 | |
| OCTOPUS | none | 32 | +7.2 | +5.9 | 2.5 | |
| OCTOPUS-QJL | qjl | 32 | +7.2 | +6.1 | 2.3 | |
| HyperQuant | none | - | +5.5 | +5.7 | 4.8 | |
| HyperQuant | qjl | - | +4.8 | +6.5 | 4.6 | |
| HyperQuant | none | 32 | +1.8 | +1.4 | 4.6 | |
| HyperQuant | qjl | 32 | +1.6 | +1.5 | 4.5 | |
| 2 | TurboQuant-MSE | none | 32 | +63.0 | +77.4 | 3.0 |
| TurboQuant-QJL | qjl | 32 | +772.0 | +1349.0 | 3.0 | |
| OCTOPUS | none | 32 | +34.7 | +41.5 | 2.9 | |
| OCTOPUS-QJL | qjl | 32 | +34.7 | +41.4 | 2.6 | |
| HyperQuant | none | - | +42.0 | +54.3 | 6.6 | |
| HyperQuant | qjl | - | +44.0 | +53.7 | 6.4 | |
| HyperQuant | none | 32 | +7.4 | +8.1 | 6.4 | |
| HyperQuant | qjl | 32 | +14.7 | +15.2 | 6.1 | |
| 1.7 | HyperQuant | none | 32 | +26.9 | +33.7 | 7.1 |
Video DiT: baseline vs. FP8 vs. INT8
This section checks whether the same weight-compression pipeline survives outside LLMs by applying it to LTX-2-19B, a 19B-parameter diffusion transformer for text-to-video. The paper quantizes all linear weights at 4 bps using FP8 or INT8 MMA paths while leaving the text encoder, VAE, and scheduler at bf16, then evaluates 32 prompts at 512x320 over 49 frames. INT8 is the cleaner quantized path in the reported metrics: it improves on FP8 in PSNR, SSIM, and LPIPS, and the visual comparison is there to make the numerical result inspectable frame by frame.
idx 23
idx 11
idx 31
idx 30