Cosmos-Predict2.5-2B Inference
NVIDIA H200 vs AMD MI300X
Single-GPU inference at 720p · Cosmos-Predict2.5-2B post-trained ·
bf16 precision · 36 diffusion steps
TL;DR
- Side-by-side output on NVIDIA H200 (Hopper, FlashAttention3) vs AMD MI300X for Cosmos Predict 2.5 at equal quality.
- ~1.27–1.49× faster wall-clock on MI300X across the scenarios below (encode + diffusion + decode).
- We are continuing to push performance further: a second version under development targets up to an additional 50% reduction in runtime.
- We would like to thank HotAisle, our AMD cloud, for bare metal access and general support.
Abstract
We ported the Cosmos family of models to AMD GPUs and present results for Cosmos Predict 2.5, a world foundation model for large-scale generative simulation in physical AI, running end-to-end on AMD MI300X. In our benchmarks, the AMD implementation achieved ~1.4× speedup over the NVIDIA baseline (Hopper, FlashAttention3) at equal quality.
To our knowledge, this is among the first production-grade deployments of a world model on AMD GPUs (alongside Micro-World), enabling serious AI simulation workloads outside the NVIDIA ecosystem. It also serves as a concrete proof point that for large diffusion-based models, AMD hardware is already competitive, and in some cases superior, to NVIDIA.
We are continuing to push performance further: a second version, currently under development, delivers up to an additional 50% reduction in runtime.
Video demos[1][1][2][2]
[1] These examples are from cosmos assets https://github.com/nvidia-cosmos/cosmos-predict2.5/tree/main/assets/base
[2] AMD peak memory is 66.97GB , NVIDIA peak memory 55.34GB