LiteAttention: A Faster Alternative to FlashAttention3 for Diffusion Transformers
TL;DR
If you are running diffusion transformers on H100 or H200 and using FlashAttention3, LiteAttention can reduce forward-pass runtime by up to 2x, without degrading visual quality.
It does this by identifying attention tiles that do not meaningfully contribute to the output, skipping them, and propagating those skip decisions across diffusion timesteps.
It is a drop-in replacement. You can use it today: https://github.com/moonmath-ai/LiteAttention.
FlashAttention3 is the industry standard for high-performance attention on Nvidia Hopper GPUs. We benchmarked it on diffusion transformer workloads with a simple goal: reduce forward-pass runtime.
FlashAttention3 is optimized to compute the full attention matrix efficiently. But when we profiled diffusion workloads carefully, we observed an opportunity.
In diffusion transformers, not all attention tiles contribute equally at every timestep. If insignificant tiles could be identified early and skipped consistently, the forward pass could be meaningfully accelerated.
That observation led to LiteAttention.
Read the full story here: LiteAttention repository.
What Changed
Instead of making each attention tile faster, LiteAttention identifies tiles whose contribution to the output is negligible.
If a tile's impact is small enough, we:
- Skip its QK multiplication
- Skip its softmax
- Skip its PV multiplication
And crucially, we propagate that skip decision across future diffusion timesteps.
This converts sparsity into real wall-clock gains.
Measured on production-scale video models:
- Greater than 50% attention sparsity
- Up to 2x forward-pass speedup
- No measurable degradation in VBench metrics
LiteAttention improves inference runtime performance on Hopper GPUs for diffusion transformers. It is not just a specialization; it is a faster alternative for this workload.
What You Can Do With It
LiteAttention is built to be used. It integrates as a drop-in replacement for FlashAttention inside DiT blocks.
In addition to temporal skip propagation, it supports:
INT8 Quantization
Optional per-block INT8 quantization for Q and K using use_int8=True:
- Q: per-block quantization
- K: per-block quantization
This reduces memory usage and improves performance.
Programmable Masking
You can explicitly control which tokens must always be computed or must always be skipped using:
must_do_listmust_skip_list
This is independent of threshold-based sparsity.
Calibration
Automatically tune skip thresholds for your model and error budget using the built-in calibration environment. Calibration can use different thresholds per attention head, improving efficiency while preserving output quality.
How to Use LiteAttention
1) Install
pip uninstall -y ninja && pip install ninja
pip install torch packaging einops structlog tomli-w
git clone https://github.com/moonmath-ai/LiteAttention.git
cd LiteAttention/hopper
pip install --no-build-isolation .
Requirements: H100/H200, CUDA >= 12.8, PyTorch >= 2.2, Linux.
2) Replace FlashAttention
Where you currently use FlashAttention:
self.attn = FlashAttention(...)
Replace with:
from lite_attention import LiteAttention
self.attn = LiteAttention(
enable_skipping=True,
use_int8=True
)
Important: instantiate one LiteAttention object per layer so skip states remain independent.
Input format remains (batch, seq_len, heads, head_dim).
3) Optional: Use Mask Control
output = self.attn(
query,
key,
value,
must_do_list=[2,12,40,45],
must_skip_list=[80,100]
)
4) Optional: Calibrate Thresholds
from lite_attention import LiteAttentionRegistry
registry = LiteAttentionRegistry.from_model(
model,
mode="calib",
filename="optimized_thresholds.toml",
calib_config={"target_error": 0.05, "metric": "L1"},
)
video = model.generate(prompt)
registry.save_if_calib()
Then switch to load for production runs.
The Bottom Line
FlashAttention3 remains a powerful full attention engine.
LiteAttention is a strict upgrade for diffusion transformers on H100. It improves forward-pass runtime performance by identifying insignificant tiles, converting sparsity into real time savings, propagating skip decisions across timesteps, and supporting INT8 quantization and programmable masking.
When all sparsity mechanisms are disabled, LiteAttention:
- Produces the exact same results as FlashAttention3
- Runs strictly faster in full attention mode
LiteAttention is continuously maintained and upgraded by us. If you are already using FlashAttention3 inside diffusion models, you can replace the attention module and measure the gains.
For diffusion inference on H100, there is no reason to use FlashAttention3 over LiteAttention.
Repo:
https://github.com/moonmath-ai/LiteAttention
Paper:
https://arxiv.org/abs/2511.11062