LiteFFN: How We Made LTX-2 19B Run Faster with Low-Rank Magic

Mar 5, 2026
LiteFFN MoonLite

TL;DR

We implemented an efficient technique to accelerate the feed-forward networks (FFN) in LTX-2, Lightricks' 19-billion parameter video generation model, achieving 11.5% peak memory reduction and 22.5% faster transformer computation, resulting in 7.6% faster end-to-end inference, with minimal quality loss.

The key insight: as attention mechanisms get faster (thanks to Flash Attention), FFN layers become the new bottleneck, and we can compress them using low-rank decomposition plus quantization.

This work builds on the theoretical foundation of SVDQuant, with practical implementation for H100 GPUs and real-world video generation workloads. LiteFFN repository: https://github.com/moonmath-ai/LiteFFN.

The Problem: Attention Got Fast, FFN Didn't

If you've been following the AI video generation space, you've probably heard of models like LTX-2 and OpenAI's Sora. These models can generate stunning video from text descriptions, but they are computationally expensive.

LTX-2 is a 19-billion parameter Diffusion Transformer that can generate native 4K video without upscaling, up to 50 FPS, synchronized audio in the same forward pass, and up to 20 seconds of high-fidelity video.

That power comes at a cost: even on an H100 GPU, generation can take [X] seconds and requires [X] GB of memory (or 12 GB minimum with FP8 quantization). This is manageable for research, but a major blocker for scaled production deployment.

Where Does the Time Go?

Transformer-based models have two main computational blocks:

Attention layers: capture relationships between different parts of the input
Feed-forward networks (FFN): process each position independently

The community has heavily optimized attention (Flash Attention 1-3, Sage Attention, LiteAttention), but FFN layers have been far less optimized.

As attention gets faster, FFN becomes the bottleneck. On LTX-2 inference, FFN is nearly 50% of transformer runtime, and this is the gap we targeted.

The Solution: Compress the FFN Smartly

Our approach combines two ideas:

Low-rank decomposition: approximate large weight matrices with smaller ones
Quantization: store the remaining error in fewer bits

The Intuition

In LTX-2, FFN matrices can be as large as 8192x2048 with tens of millions of parameters. Not all parameters carry equal signal, and many capture overlapping structure.

Low-rank decomposition represents one large matrix as the product of two smaller matrices, preserving most useful behavior at much lower compute and memory cost. Quantization then captures the leftover correction terms efficiently.^[1]

How It Works

Step 1: Collect Calibration Data

We run the model on a sample of real prompts and collect per-layer input statistics to characterize typical activation structure.

For each FFN layer, we compute:

R = E[x * x^T]

For 2048-dimensional inputs, each autocorrelation matrix has 4 million entries. Across many layers, memory grows quickly, so we compute R incrementally:

# Pseudocode for incremental autocorrelation
R_sum = zeros((dim, dim))
N = 0

for batch in calibration_data:
    x = get_activations(batch)  # Shape: [batch, seq_len, dim]
    x_flat = x.reshape(-1, dim) # Flatten to [N_samples, dim]
    R_sum += x_flat.T @ x_flat  # Accumulate outer product
    N += x_flat.shape[0]

R = R_sum / N

Step 2: Compute Effective Weights

We use calibration statistics to transform the original weight matrix into a form that is easier to compress:

W_effective = W @ R^(1/2)

This reweights columns by practical usage, improving how efficiently SVD can approximate the matrix for real workloads.

Step 3: Decompose

W_effective = U @ S @ V^T

We keep the top-r singular components:

W_low_rank_eff = U[:, :r] @ S[:r] @ V[:r, :]^T

Step 4: Handle the Remainder

Remainder = W - W_low_rank

This residual is quantized to 4-bit or 8-bit formats.

Step 5: Inference

def forward(x):
    # Low-rank path (full precision, small matrices)
    y_lowrank = A @ (B @ x)

    # Remainder path (quantized)
    y_remainder = Q_quantized @ x

    return y_lowrank + y_remainder + bias

Quantization Options

Current production format on Hopper:

FP8 E4M3

Bits: 8 (4 exponent, 3 mantissa)
Range: approximately +/-448
Pros: native H100 tensor core support, strong accuracy
Cons: 2x memory versus FP4

Planned formats for LiteFFN on Blackwell and future Hopper support include NVFP4 E2M1 and MXFP4.

The Fun Part: Multiply-Free FP4

For non-native FP4 support, NVFP4's limited value set allows replacing floating-point multiplies with shift/add operations.

FP4 Value	Operation	Instructions
0	return 0	none
0.5	x >> 1	1 shift
1	x	identity
1.5	x + (x >> 1)	1 shift, 1 add
2	x << 1	1 shift
3	(x << 1) + x	1 shift, 1 add
4	x << 2	1 shift
6	(x << 2) + (x << 1)	2 shifts, 1 add

Results

Average Runtime

Group	Transformer Mean (s)	Min (s)	Max (s)	Std (s)	Transformer % Faster	Decode Mean (s)	Save (s)	E2E Total (s)	E2E % Faster
baseline	4.520	4.460	4.650	0.070	0.00%	3.710	5.100	13.330	0.00%
liteffn	3.500	3.490	3.520	0.010	22.57%	3.710	5.100	12.310	7.65%

Allocated VRAM Reduction

Configuration	Average	Peak	Peak Relative to Baseline
Original (FP16)	59,919.00 MB	65,663.00 MB	100%
r=64	44,517.00 MB	58,433.00 MB	89.0%

Video Samples (baseline vs LiteFFN r=32 / r=64 / r=512)

Click a thumbnail to play the video.

Baseline	LiteFFN r=32	LiteFFN r=64	LiteFFN r=512

Quality Measurements

PSNR average over different seeds, per prompt (dB)^[2] (Higher is better; 20 dB corresponds to a mean squared error of roughly 10⁻²):

Prompt	baseline vs r=32	baseline vs r=64	baseline vs r=512
a-dramatic-underwater-scene-featuring-a-person-s	19.823	19.822	20.413
a-man-in-a-sleek-modern-jetpack-flying-upwards-t	20.872	21.517	21.163
a-serene-view-of-the-banks-of-the-rhine-river-sh	20.277	20.208	19.735
a-single-water-droplet-falls-from-a-height-movin	27.827	27.375	30.680
two-anthropomorphic-cats-boxing-in-a-well-lit-ar	18.646	20.924	21.188

Prompt PSNR Summary

Prompt	Total seeds	PSNR > 20 dB	PSNR > 17 dB
a-dramatic-underwater-scene-featuring-a-person-s	30	10	30
a-man-in-a-sleek-modern-jetpack-flying-upwards-t	30	30	30
a-serene-view-of-the-banks-of-the-rhine-river-sh	30	20	30
a-single-water-droplet-falls-from-a-height-movin	30	30	30
two-anthropomorphic-cats-boxing-in-a-well-lit-ar	30	20	30

Performance Benchmark on Shapes Captured from LTX-Video

Multiplier columns are speedup ratios vs baseline linear (>1 faster, <1 slower).

Units:

per-shape rows: us
TOTAL row: ms

Column glossary:

Cfg: FFN projection shape (w1 = up-proj, w2 = down-proj).
M: flattened activation rows for that GEMM shape.
Count: number of calls for that shape in the captured workload.
Lin: baseline nn.Linear latency.
TE: Transformer Engine linear latency.
PT: LiteFFN PyTorch path latency.
CUDA: LiteFFN CUDA path latency.
TE_x / PT_x / CUDA_x: speedup multiplier vs baseline linear (>1 is faster, <1 is slower).
TOTAL: count-weighted aggregate across listed shapes.

^[3]
Cfg	M	Count	Lin	TE	PT	CUDA	TE_x	PT_x	CUDA_x
w2	1400	336	392	260	298	186	1.508x	1.315x	2.108x
w1	1400	336	378	273	301	182	1.385x	1.256x	2.077x
w2	2450	336	597	437	501	317	1.366x	1.192x	1.883x
w1	2450	336	562	455	511	302	1.235x	1.100x	1.861x
w2	5600	480	1213	994	1215	762	1.220x	0.998x	1.592x
w1	5600	480	1141	1097	1198	712	1.040x	0.952x	1.603x
w2	9800	144	2008	1763	2074	1261	1.139x	0.968x	1.592x
w1	9800	144	1946	1868	2042	1202	1.042x	0.953x	1.619x
w2	10850	336	2277	2110	2281	1395	1.079x	0.998x	1.632x
w1	10850	336	2035	2156	2215	1324	0.944x	0.919x	1.537x
w2	22400	144	4889	4506	4596	3018	1.085x	1.064x	1.620x
w1	22400	144	4714	4581	4471	2805	1.029x	1.054x	1.681x
w2	43400	144	9275	8285	8845	5772	1.119x	1.049x	1.607x
w1	43400	144	8847	8460	8669	5328	1.046x	1.021x	1.660x
TOTAL	-	3840	7788	7158	7631	4744	1.088x	1.021x	1.642x

Future Work

Attention projection decomposition (Q/K/V/O)
Adaptive per-layer rank selection from singular value decay
Dynamic rank by denoising timestep
More optimized FP4 kernels with warp-level tuning
Extension to other video models, world models, and VLMs

Acknowledgments

This work would not exist without SVDQuant. We also thank the Lightricks team for open-sourcing LTX-2 and the Flash Attention authors for pushing efficient attention forward.

References

LiteFFN repository: https://github.com/moonmath-ai/LiteFFN
SVDQuant Paper: https://arxiv.org/abs/2411.05007
LTX-2: https://huggingface.co/Lightricks/LTX-2

TL;DR

LiteFFN repository: https://github.com/moonmath-ai/LiteFFN
We accelerated FFN in LTX-2 19B with low-rank decomposition + quantization, with 11.5% peak memory reduction and 22.5% faster transformer compute.
End-to-end inference improved by 7.6% with minimal quality loss, validated on real video generation workloads.
As attention gets faster, FFN becomes the bottleneck. This work targets that shift directly with practical Hopper-focused implementation.

We did not invent this approach. The theoretical foundation comes from SVDQuant, including:
- Showing low-rank + quantization beats using either one alone
- Introducing calibration-aware decomposition using input statistics
- Demonstrating effectiveness on text-to-image DiTs
Our contribution is implementation at video scale:
- Optimized CUDA kernels for H100 GPUs
- An efficient calibration pipeline that scales to large datasets
- Native FP8 support
- Integration with LTX-2 19B as a proof of concept
Higher is better. A PSNR of 20 dB corresponds to a mean squared error of roughly 10⁻².
Units:
- per-shape rows: us
- TOTAL row: ms
Column glossary:
- Cfg: FFN projection shape (w1 = up-proj, w2 = down-proj).
- M: flattened activation rows for that GEMM shape.
- Count: number of calls for that shape in the captured workload.
- Lin: baseline nn.Linear latency.
- TE: Transformer Engine linear latency.
- PT: LiteFFN PyTorch path latency.
- CUDA: LiteFFN CUDA path latency.
- TE_x / PT_x / CUDA_x: speedup multiplier vs baseline linear (>1 is faster, <1 is slower).
- TOTAL: count-weighted aggregate across listed shapes.