Partnership - Bria Fibo Acceleration
TL;DR
- Try MoonMath’s Fibo optimization: https://fal.ai/models/bria/fibo-lite/generate/structured_prompt
- Joint optimization of Bria FIBO’s VLM and diffusion stages across NVIDIA Hopper, NVIDIA DGX Spark, and AMD Instinct.
- VLM latency dropped from roughly 7 seconds to about 1.5–2 seconds (~3–4×); diffusion improved from ~0.9s to ~0.7s while preserving quality.
- On DGX Spark, end-to-end work brought baseline runtime from ~70s to under 10 seconds per image generation/edit, with further improvements in progress.
In this post, we describe a joint optimization effort between MoonMath and Bria to accelerate the Fibo family of image models. The work spans both Vision-Language Model (VLM) and diffusion components, and targets deployments on NVIDIA Hopper, NVIDIA Blackwell (DGX Spark), and AMD Instinct.
Background
Bria is a generative AI company focused on building production-grade visual models for enterprises, with an emphasis on commercially safe and compliant content generation. Its models are trained on fully licensed data, enabling businesses to generate and edit images without legal ambiguity.
Fibo is the first ever open-source model trained exclusively on long, structured captions. Fibo presents a new paradigm for image generation, a two-stage approach consisting of a reasoner (VLM) to expand user intention, and a renderer (diffusion) to produce the final image.[1] (https://arxiv.org/pdf/2511.06876v1)
Through Fibo and downstream systems, Bria delivers high-quality image generation capabilities optimized for real-world deployment, including controllability, consistency, and integration into production workflows.
At the start of this collaboration:
- The VLM stage was already optimized and running on vLLM.
- The diffusion stage was already optimized using techniques such as quantization (via Pruna).
Stage 1: VLM acceleration
The VLM stage is responsible for generating the structured JSON output. Since vLLM already ensures efficient kernel execution, our work concentrated on designing a scheduling and parallelization strategy that better utilizes the underlying hardware. This approach is broadly applicable beyond the Bria use case, and we plan to cover it in a dedicated blog post. If you are working on VLM acceleration, feel free to reach out.
This effort resulted in a significant reduction in latency: the VLM stage improved from approximately 7 seconds to 1.5–2 seconds, achieving a 3–4× speedup. As shown below, we measured 1.7 seconds on AMD MI300X and 1.4 seconds on NVIDIA H100[2] (The Hopper-Instinct comparison is intentional: we typically offer partners a choice between platforms; in this case, Bria selected NVIDIA, so subsequent optimization focused on that stack.). Because no kernels were modified, output quality remains identical to the baseline.
Stage 2: Diffusion Optimization
The second stage converts structured prompts into images via diffusion. This stage was already highly optimized (e.g., quantization), but additional gains were still possible.
We introduced runtime-level optimizations, including: Custom attention masking (via LiteAttention), CUDA Graphs integration to reduce execution overhead, Fine-tuning execution flow for improved efficiency. As a result, latency reduced from ~0.9 seconds to ~0.7 seconds, further tightening the end-to-end latency with no degradation in output quality.
Bria was presented with multiple optimization configurations, selecting the best tradeoff based on their domain expertise.
Stage 3: DGX Spark (Blackwell) Acceleration
NVIDIA DGX Spark is a compact, desktop-class Blackwell-based AI system that tightly couples Grace (Arm) CPUs and GPUs via high-bandwidth unified memory, enabling efficient data sharing and high-throughput inference in a significantly reduced form factor.
Our goal was to reduce latency to under 10 seconds per image generation or edit, extracting maximum performance from the Spark architecture to enable Bria’s efficient local model execution for future use cases.
After porting the model to Spark, the initial baseline stood at approximately 70 seconds. We focused on aligning the model configuration, kernels, and runtime environment with the hardware. To get us to 10 seconds, here is what we’ve done so far:
- Building libraries from source with SM121 support
- Quantizing models to NVFP4
- Testing multiple attention implementations to determine the best fit for the hardware
- Applying custom vLLM patches to ensure compatibility
We are not done. We are actively working on getting it closer to real time, ping us if you want to learn more on how to run high-end diffusion models on DGX Spark.
Parting Thoughts
This collaboration highlights a broader theme: The next frontier in generative media model performance is not just better models, it is better systems.
By combining mathematical insight, systems engineering, and hardware awareness, it is possible to push production systems significantly beyond “out-of-the-box” performance without touching model weights or sacrificing quality.
These same optimization techniques can be applied to similar challenges in VLMs, diffusion, and world models, making them relevant for teams looking to push performance on real production workloads.
Try MoonMath’s Fibo optimization: https://fal.ai/models/bria/fibo-lite/generate/structured_prompt