LiteAttention: A Faster Alternative to FlashAttention3 for Diffusion Transformers

MoonMath.ai · LiteAttention

TL;DR

If you are running diffusion transformers on H100 or H200 and using FlashAttention3, LiteAttention can reduce forward-pass runtime by up to 2x, without degrading visual quality.

It does this by identifying attention tiles that do not meaningfully contribute to the output, skipping them, and propagating those skip decisions across diffusion timesteps.

It is a drop-in replacement. You can use it today: https://github.com/moonmath-ai/LiteAttention.

FlashAttention3 is the industry standard for high-performance attention on Nvidia Hopper GPUs. We benchmarked it on diffusion transformer workloads with a simple goal: reduce forward-pass runtime.

FlashAttention3 is optimized to compute the full attention matrix efficiently. But when we profiled diffusion workloads carefully, we observed an opportunity.

In diffusion transformers, not all attention tiles contribute equally at every timestep. If insignificant tiles could be identified early and skipped consistently, the forward pass could be meaningfully accelerated.

That observation led to LiteAttention.

Read the full story here: LiteAttention repository.

What Changed

Instead of making each attention tile faster, LiteAttention identifies tiles whose contribution to the output is negligible.

If a tile's impact is small enough, we:

Skip its QK multiplication
Skip its softmax
Skip its PV multiplication

And crucially, we propagate that skip decision across future diffusion timesteps.

This converts sparsity into real wall-clock gains.

Measured on production-scale video models:

Greater than 50% attention sparsity
Up to 2x forward-pass speedup
No measurable degradation in VBench metrics

LiteAttention improves inference runtime performance on Hopper GPUs for diffusion transformers. It is not just a specialization; it is a faster alternative for this workload.

What You Can Do With It

LiteAttention is built to be used. It integrates as a drop-in replacement for FlashAttention inside DiT blocks.

In addition to temporal skip propagation, it supports:

INT8 Quantization

Optional per-block INT8 quantization for Q and K using use_int8=True:

Q: per-block quantization
K: per-block quantization

This reduces memory usage and improves performance.

Programmable Masking

You can explicitly control which tokens must always be computed or must always be skipped using:

must_do_list
must_skip_list

This is independent of threshold-based sparsity.

Calibration

Automatically tune skip thresholds for your model and error budget using the built-in calibration environment. Calibration can use different thresholds per attention head, improving efficiency while preserving output quality.

How to Use LiteAttention

1) Install

pip uninstall -y ninja && pip install ninja
pip install torch packaging einops structlog tomli-w

git clone https://github.com/moonmath-ai/LiteAttention.git
cd LiteAttention/hopper
pip install --no-build-isolation .

Requirements: H100/H200, CUDA >= 12.8, PyTorch >= 2.2, Linux.

2) Replace FlashAttention

Where you currently use FlashAttention:

self.attn = FlashAttention(...)

Replace with:

from lite_attention import LiteAttention

self.attn = LiteAttention(
   enable_skipping=True,
   use_int8=True
)

Important: instantiate one LiteAttention object per layer so skip states remain independent. Input format remains (batch, seq_len, heads, head_dim).

3) Optional: Use Mask Control

output = self.attn(
   query,
   key,
   value,
   must_do_list=[2,12,40,45],
   must_skip_list=[80,100]
)

4) Optional: Calibrate Thresholds

from lite_attention import LiteAttentionRegistry

registry = LiteAttentionRegistry.from_model(
   model,
   mode="calib",
   filename="optimized_thresholds.toml",
   calib_config={"target_error": 0.05, "metric": "L1"},
)

video = model.generate(prompt)
registry.save_if_calib()

Then switch to load for production runs.

The Bottom Line

FlashAttention3 remains a powerful full attention engine.

LiteAttention is a strict upgrade for diffusion transformers on H100. It improves forward-pass runtime performance by identifying insignificant tiles, converting sparsity into real time savings, propagating skip decisions across timesteps, and supporting INT8 quantization and programmable masking.

When all sparsity mechanisms are disabled, LiteAttention:

Produces the exact same results as FlashAttention3
Runs strictly faster in full attention mode

LiteAttention is continuously maintained and upgraded by us. If you are already using FlashAttention3 inside diffusion models, you can replace the attention module and measure the gains.

For diffusion inference on H100, there is no reason to use FlashAttention3 over LiteAttention.

Repo: https://github.com/moonmath-ai/LiteAttention
Paper: https://arxiv.org/abs/2511.11062