HyperQuant

A Rate–Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

arXiv PDF Code (coming soon)

We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the linear layers and the KV cache of large language and diffusion transformers.

HyperQuant combines four well studied together into a single construction: (i) a per-tile Walsh-Hadamard rotation that whitens weights and activations to approximate multivariate-Gaussian statistics, (ii) a low-dimensional optimal lattice (E8, D4, A2, or Z) quantization, (iii) a near-entropy-optimal variable-length Rice coding scheme over lattice indices, and (iv) various bias-correction methods for the KV cache, that keep the reconstruction unbiased under inner products, preserving attention semantics.

How HyperQuant works

HyperQuant is built as one encode/decode pipeline that can serve two very different workloads: linear weights, which are compressed offline once, and KV cache vectors, which are produced online during generation. The common path first uses a randomized Hadamard transform to make each tile look closer to an isotropic Gaussian, then rescales it to a lattice operating point, quantizes it, strips the lattice code into compact integer symbols, and Rice-codes those symbols into a bitstream. On decode, the process runs backward into the low-precision MMA input; the KV-cache path adds bias-correction hooks, such as rotation and subtractive dither, because attention is especially sensitive to biased inner products.

Recreated HyperQuant encode and decode flowchart.

Experiments and results

The experiments ask whether the same compression idea holds across three settings: Llama weight compression, online KV-cache compression, and a non-LLM stress test on the LTX-2 video DiT. For the language-model experiments, the paper reports pseudo-quantization perplexity on WikiText-2, with the KV comparison also matching the Qwen2.5-7B-Instruct-1M setup used by OCTOPUS. For video, the paper applies HyperQuant to the LTX-2-19B linear weights at 4 bps and reports image/video quality metrics over a 32-prompt suite, while noting that wall-clock acceleration still depends on a fused kernel implementation.

Setting	Rate	Criterion	HyperQuant	Reference
Weights + KV cache, INT8 MMA	4 bps	PPL ↓	7.50 (+0.47%)	7.16 (bf16)
Weights	4 bps	ΔPPL% ↓	+3.8%	+6.4% (HIGGS)
Weights	3 bps	ΔPPL% ↓	+22.1%	+33% (HIGGS)
KV cache	2 bps	ΔPPL% ↓	+7.4%	+34.7% (OCTOPUS)
KV cache	2 bps	Compression ↑	6.4x	3.0x (TurboQuant)
KV cache	1.7 bps	ΔPPL% ↓	+26.9%	-
LTX-2 video	4 bps	LPIPS ↓	0.20-0.21	0 (bf16)

Weights: perplexity vs. bitrate

This is the direct weight-only comparison against HIGGS on Llama-3.1-8B-Instruct evaluated on WikiText-2. HIGGS uses fixed-rate finite codebooks; HyperQuant uses a lattice plus variable-length Rice coding, so it can hit the same bps targets while spending fewer bits on likely symbols and allowing rare tail values to cost more. The important read is that every HyperQuant lattice beats HIGGS at the tested rates from 3 to 5 bps, with the advantage most visible in the more aggressive low-bit regime.

Llama-3.1-8B WikiText-2 perplexity versus bits per scalar for HyperQuant and HIGGS. — Perplexity versus bits per scalar for HyperQuant lattices and HIGGS.

Weights: why the gap appears

The paper explains the HIGGS gap through weight SNR, which is the quantity that best predicts perplexity damage for this kind of post-training weight quantization. HyperQuant gains from two places at once: Rice coding recovers entropy slack that fixed-rate codebooks leave on the table, and the unbounded lattice codebook avoids the tail-cell problem of finite codebooks. Higher-dimensional lattices such as D4 and E8 add another source of gain because their Voronoi cells have better second moments, which is why the SNR curve stays above HIGGS across the range.

KV cache: HyperQuant vs. TurboQuant / OCTOPUS

The KV-cache comparison is stricter than a simple quality table because the methods use protection tricks that change the effective compression rate. The paper matches the 32-token residual window used by OCTOPUS and reports true KV compression after accounting for protected tiles and the residual window. Under that matched setup, HyperQuant improves both axes at the key low-bit points: at 2 bps it brings the WikiText-2 perplexity increase down to +7.4% while reaching 6.4x KV compression, compared with OCTOPUS at +34.7% and 2.9x.

Bits	Codec	Corr.	Residual window	W2 Delta% ↓	C4 Delta% ↓	KVx ↑
4	TurboQuant-MSE	none	32	+3.1	+1.7	2.2
	TurboQuant-QJL	qjl	32	+8.0	+7.9	2.2
	OCTOPUS	none	32	+2.7	+1.5	2.2
	OCTOPUS-QJL	qjl	32	+2.7	+1.5	2.0
	HyperQuant	none	-	+0.8	+1.0	3.7
	HyperQuant	qjl	-	+1.4	+1.0	3.6
	HyperQuant	none	32	+0.1	+0.2	3.6
	HyperQuant	qjl	32	+0.2	+0.3	3.5
3	TurboQuant-MSE	none	32	+8.6	+8.3	2.6
	TurboQuant-QJL	qjl	32	+50.4	+59.9	2.5
	OCTOPUS	none	32	+7.2	+5.9	2.5
	OCTOPUS-QJL	qjl	32	+7.2	+6.1	2.3
	HyperQuant	none	-	+5.5	+5.7	4.8
	HyperQuant	qjl	-	+4.8	+6.5	4.6
	HyperQuant	none	32	+1.8	+1.4	4.6
	HyperQuant	qjl	32	+1.6	+1.5	4.5
2	TurboQuant-MSE	none	32	+63.0	+77.4	3.0
	TurboQuant-QJL	qjl	32	+772.0	+1349.0	3.0
	OCTOPUS	none	32	+34.7	+41.5	2.9
	OCTOPUS-QJL	qjl	32	+34.7	+41.4	2.6
	HyperQuant	none	-	+42.0	+54.3	6.6
	HyperQuant	qjl	-	+44.0	+53.7	6.4
	HyperQuant	none	32	+7.4	+8.1	6.4
	HyperQuant	qjl	32	+14.7	+15.2	6.1
1.7	HyperQuant	none	32	+26.9	+33.7	7.1

Video DiT: baseline vs. FP8 vs. INT8

This section checks whether the same weight-compression pipeline survives outside LLMs by applying it to LTX-2-19B, a 19B-parameter diffusion transformer for text-to-video. The paper quantizes all linear weights at 4 bps using FP8 or INT8 MMA paths while leaving the text encoder, VAE, and scheduler at bf16, then evaluates 32 prompts at 512x320 over 49 frames. INT8 is the cleaner quantized path in the reported metrics: it improves on FP8 in PSNR, SSIM, and LPIPS, and the visual comparison is there to make the numerical result inspectable frame by frame.

Volcanic slope
idx 23

Starlings
idx 11

Robotic arms
idx 31

Tokyo crossing
idx 30