LiteFFN: How We Made LTX-2 19B Run Faster with Low-Rank Magic
TL;DR
We implemented an efficient technique to accelerate the feed-forward networks (FFN) in LTX-2, Lightricks' 19-billion parameter video generation model, achieving 11.5% peak memory reduction and 22.5% faster transformer computation, resulting in 7.6% faster end-to-end inference, with minimal quality loss.
The key insight: as attention mechanisms get faster (thanks to Flash Attention), FFN layers become the new bottleneck, and we can compress them using low-rank decomposition plus quantization.
This work builds on the theoretical foundation of SVDQuant, with practical implementation for H100 GPUs and real-world video generation workloads. LiteFFN repository: https://github.com/moonmath-ai/LiteFFN.
The Problem: Attention Got Fast, FFN Didn't
If you've been following the AI video generation space, you've probably heard of models like LTX-2 and OpenAI's Sora. These models can generate stunning video from text descriptions, but they are computationally expensive.
LTX-2 is a 19-billion parameter Diffusion Transformer that can generate native 4K video without upscaling, up to 50 FPS, synchronized audio in the same forward pass, and up to 20 seconds of high-fidelity video.
That power comes at a cost: even on an H100 GPU, generation can take [X] seconds and requires [X] GB of memory (or 12 GB minimum with FP8 quantization). This is manageable for research, but a major blocker for scaled production deployment.
Where Does the Time Go?
Transformer-based models have two main computational blocks:
- Attention layers: capture relationships between different parts of the input
- Feed-forward networks (FFN): process each position independently
The community has heavily optimized attention (Flash Attention 1-3, Sage Attention, LiteAttention), but FFN layers have been far less optimized.
As attention gets faster, FFN becomes the bottleneck. On LTX-2 inference, FFN is nearly 50% of transformer runtime, and this is the gap we targeted.
The Solution: Compress the FFN Smartly
Our approach combines two ideas:
- Low-rank decomposition: approximate large weight matrices with smaller ones
- Quantization: store the remaining error in fewer bits
The Intuition
In LTX-2, FFN matrices can be as large as 8192x2048 with tens of millions of parameters. Not all parameters carry equal signal, and many capture overlapping structure.
Low-rank decomposition represents one large matrix as the product of two smaller matrices, preserving most useful behavior at much lower compute and memory cost. Quantization then captures the leftover correction terms efficiently.[1]
How It Works
Step 1: Collect Calibration Data
We run the model on a sample of real prompts and collect per-layer input statistics to characterize typical activation structure.
For each FFN layer, we compute:
R = E[x * x^T]
For 2048-dimensional inputs, each autocorrelation matrix has 4 million entries. Across many
layers, memory grows quickly, so we compute R incrementally:
# Pseudocode for incremental autocorrelation
R_sum = zeros((dim, dim))
N = 0
for batch in calibration_data:
x = get_activations(batch) # Shape: [batch, seq_len, dim]
x_flat = x.reshape(-1, dim) # Flatten to [N_samples, dim]
R_sum += x_flat.T @ x_flat # Accumulate outer product
N += x_flat.shape[0]
R = R_sum / N
Step 2: Compute Effective Weights
We use calibration statistics to transform the original weight matrix into a form that is easier to compress:
W_effective = W @ R^(1/2)
This reweights columns by practical usage, improving how efficiently SVD can approximate the matrix for real workloads.
Step 3: Decompose
W_effective = U @ S @ V^T
We keep the top-r singular components:
W_low_rank_eff = U[:, :r] @ S[:r] @ V[:r, :]^T
Step 4: Handle the Remainder
Remainder = W - W_low_rank
This residual is quantized to 4-bit or 8-bit formats.
Step 5: Inference
def forward(x):
# Low-rank path (full precision, small matrices)
y_lowrank = A @ (B @ x)
# Remainder path (quantized)
y_remainder = Q_quantized @ x
return y_lowrank + y_remainder + bias
Quantization Options
Current production format on Hopper:
FP8 E4M3
- Bits: 8 (4 exponent, 3 mantissa)
- Range: approximately +/-448
- Pros: native H100 tensor core support, strong accuracy
- Cons: 2x memory versus FP4
Planned formats for LiteFFN on Blackwell and future Hopper support include NVFP4 E2M1 and MXFP4.
The Fun Part: Multiply-Free FP4
For non-native FP4 support, NVFP4's limited value set allows replacing floating-point multiplies with shift/add operations.
| FP4 Value | Operation | Instructions |
|---|---|---|
| 0 | return 0 | none |
| 0.5 | x >> 1 | 1 shift |
| 1 | x | identity |
| 1.5 | x + (x >> 1) | 1 shift, 1 add |
| 2 | x << 1 | 1 shift |
| 3 | (x << 1) + x | 1 shift, 1 add |
| 4 | x << 2 | 1 shift |
| 6 | (x << 2) + (x << 1) | 2 shifts, 1 add |
Results
Average Runtime
| Group | Transformer Mean (s) | Min (s) | Max (s) | Std (s) | Transformer % Faster | Decode Mean (s) | Save (s) | E2E Total (s) | E2E % Faster |
|---|---|---|---|---|---|---|---|---|---|
| baseline | 4.520 | 4.460 | 4.650 | 0.070 | 0.00% | 3.710 | 5.100 | 13.330 | 0.00% |
| liteffn | 3.500 | 3.490 | 3.520 | 0.010 | 22.57% | 3.710 | 5.100 | 12.310 | 7.65% |
Allocated VRAM Reduction
| Configuration | Average | Peak | Peak Relative to Baseline |
|---|---|---|---|
| Original (FP16) | 59,919.00 MB | 65,663.00 MB | 100% |
| r=64 | 44,517.00 MB | 58,433.00 MB | 89.0% |
Video Samples (baseline vs LiteFFN r=32 / r=64 / r=512)
Click a thumbnail to play the video.
Quality Measurements
PSNR average over different seeds, per prompt (dB)[2] (Higher is better; 20 dB corresponds to a mean squared error of roughly 10−2):
| Prompt | baseline vs r=32 | baseline vs r=64 | baseline vs r=512 |
|---|---|---|---|
| a-dramatic-underwater-scene-featuring-a-person-s | 19.823 | 19.822 | 20.413 |
| a-man-in-a-sleek-modern-jetpack-flying-upwards-t | 20.872 | 21.517 | 21.163 |
| a-serene-view-of-the-banks-of-the-rhine-river-sh | 20.277 | 20.208 | 19.735 |
| a-single-water-droplet-falls-from-a-height-movin | 27.827 | 27.375 | 30.680 |
| two-anthropomorphic-cats-boxing-in-a-well-lit-ar | 18.646 | 20.924 | 21.188 |
Prompt PSNR Summary
| Prompt | Total seeds | PSNR > 20 dB | PSNR > 17 dB |
|---|---|---|---|
| a-dramatic-underwater-scene-featuring-a-person-s | 30 | 10 | 30 |
| a-man-in-a-sleek-modern-jetpack-flying-upwards-t | 30 | 30 | 30 |
| a-serene-view-of-the-banks-of-the-rhine-river-sh | 30 | 20 | 30 |
| a-single-water-droplet-falls-from-a-height-movin | 30 | 30 | 30 |
| two-anthropomorphic-cats-boxing-in-a-well-lit-ar | 30 | 20 | 30 |
Performance Benchmark on Shapes Captured from LTX-Video
Multiplier columns are speedup ratios vs baseline linear (>1 faster, <1 slower).
Units:
- per-shape rows:
us TOTALrow:ms
Column glossary:
Cfg: FFN projection shape (w1= up-proj,w2= down-proj).M: flattened activation rows for that GEMM shape.Count: number of calls for that shape in the captured workload.Lin: baselinenn.Linearlatency.TE: Transformer Engine linear latency.PT: LiteFFN PyTorch path latency.CUDA: LiteFFN CUDA path latency.TE_x/PT_x/CUDA_x: speedup multiplier vs baseline linear (>1is faster,<1is slower).TOTAL: count-weighted aggregate across listed shapes.
| Cfg | M | Count | Lin | TE | PT | CUDA | TE_x | PT_x | CUDA_x |
|---|---|---|---|---|---|---|---|---|---|
| w2 | 1400 | 336 | 392 | 260 | 298 | 186 | 1.508x | 1.315x | 2.108x |
| w1 | 1400 | 336 | 378 | 273 | 301 | 182 | 1.385x | 1.256x | 2.077x |
| w2 | 2450 | 336 | 597 | 437 | 501 | 317 | 1.366x | 1.192x | 1.883x |
| w1 | 2450 | 336 | 562 | 455 | 511 | 302 | 1.235x | 1.100x | 1.861x |
| w2 | 5600 | 480 | 1213 | 994 | 1215 | 762 | 1.220x | 0.998x | 1.592x |
| w1 | 5600 | 480 | 1141 | 1097 | 1198 | 712 | 1.040x | 0.952x | 1.603x |
| w2 | 9800 | 144 | 2008 | 1763 | 2074 | 1261 | 1.139x | 0.968x | 1.592x |
| w1 | 9800 | 144 | 1946 | 1868 | 2042 | 1202 | 1.042x | 0.953x | 1.619x |
| w2 | 10850 | 336 | 2277 | 2110 | 2281 | 1395 | 1.079x | 0.998x | 1.632x |
| w1 | 10850 | 336 | 2035 | 2156 | 2215 | 1324 | 0.944x | 0.919x | 1.537x |
| w2 | 22400 | 144 | 4889 | 4506 | 4596 | 3018 | 1.085x | 1.064x | 1.620x |
| w1 | 22400 | 144 | 4714 | 4581 | 4471 | 2805 | 1.029x | 1.054x | 1.681x |
| w2 | 43400 | 144 | 9275 | 8285 | 8845 | 5772 | 1.119x | 1.049x | 1.607x |
| w1 | 43400 | 144 | 8847 | 8460 | 8669 | 5328 | 1.046x | 1.021x | 1.660x |
| TOTAL | - | 3840 | 7788 | 7158 | 7631 | 4744 | 1.088x | 1.021x | 1.642x |
Future Work
- Attention projection decomposition (Q/K/V/O)
- Adaptive per-layer rank selection from singular value decay
- Dynamic rank by denoising timestep
- More optimized FP4 kernels with warp-level tuning
- Extension to other video models, world models, and VLMs
Acknowledgments
This work would not exist without SVDQuant. We also thank the Lightricks team for open-sourcing LTX-2 and the Flash Attention authors for pushing efficient attention forward.
References
- LiteFFN repository: https://github.com/moonmath-ai/LiteFFN
- SVDQuant Paper: https://arxiv.org/abs/2411.05007
- LTX-2: https://huggingface.co/Lightricks/LTX-2