Cosmos-Predict2.5-2B Inference
NVIDIA H200 vs AMD MI300X

Single-GPU inference at 720p · Cosmos-Predict2.5-2B post-trained ·
bf16 precision · 36 diffusion steps

TL;DR

  • Side-by-side output on NVIDIA H200 (Hopper, FlashAttention3) vs AMD MI300X for Cosmos Predict 2.5 at equal quality.
  • ~1.27–1.49× faster wall-clock on MI300X across the scenarios below (encode + diffusion + decode).
  • We are continuing to push performance further: a second version under development targets up to an additional 50% reduction in runtime.
  • We would like to thank HotAisle, our AMD cloud, for bare metal access and general support.

Abstract

We ported the Cosmos family of models to AMD GPUs and present results for Cosmos Predict 2.5, a world foundation model for large-scale generative simulation in physical AI, running end-to-end on AMD MI300X. In our benchmarks, the AMD implementation achieved ~1.4× speedup over the NVIDIA baseline (Hopper, FlashAttention3) at equal quality.

To our knowledge, this is among the first production-grade deployments of a world model on AMD GPUs (alongside Micro-World), enabling serious AI simulation workloads outside the NVIDIA ecosystem. It also serves as a concrete proof point that for large diffusion-based models, AMD hardware is already competitive, and in some cases superior, to NVIDIA.

We are continuing to push performance further: a second version, currently under development, delivers up to an additional 50% reduction in runtime.

Video demos[1][1][2][2]

[1] These examples are from cosmos assets https://github.com/nvidia-cosmos/cosmos-predict2.5/tree/main/assets/base

[2] AMD peak memory is 66.97GB ,  NVIDIA peak memory 55.34GB

Bus Terminal

Image + Text Speedup: 1.44×
NVIDIA H200
Encode 7.436s Diffusion 242.0s Decode 5.402s Total 254.840s
AMD MI300X
Encode 3.954s Diffusion 166.0s Decode 7.287s Total 177.240s

Bus Terminal (Long)

Image + Text Speedup: 1.39×
NVIDIA H200
Encode 12.237s Diffusion 726.0s Decode 14.071s Total 752.310s
AMD MI300X
Encode 11.825s Diffusion 506.0s Decode 21.808s Total 539.630s

Robot Pouring

Video + Text Speedup: 1.38×
NVIDIA H200
Encode 7.421s Diffusion 242.0s Decode 5.365s Total 254.790s
AMD MI300X
Encode 4.013s Diffusion 174.0s Decode 7.268s Total 185.280s

Robot Tightening

Text Speedup: 1.27×
NVIDIA H200
Encode 7.502s Diffusion 242.0s Decode 5.388s Total 254.890s
AMD MI300X
Encode 3.946s Diffusion 190.0s Decode 7.282s Total 201.230s

Robot Welding

Image + Text Speedup: 1.41×
NVIDIA H200
Encode 7.330s Diffusion 242.0s Decode 5.365s Total 254.700s
AMD MI300X
Encode 3.942s Diffusion 170.0s Decode 7.256s Total 181.200s

Sand Mining

Video + Text Speedup: 1.49×
NVIDIA H200
Encode 7.360s Diffusion 242.0s Decode 5.413s Total 254.770s
AMD MI300X
Encode 4.022s Diffusion 160.0s Decode 7.275s Total 171.300s

Snowy Stop Light

Text Speedup: 1.30×
NVIDIA H200
Encode 7.449s Diffusion 242.0s Decode 5.381s Total 254.830s
AMD MI300X
Encode 3.947s Diffusion 185.0s Decode 7.275s Total 196.220s

Learn more

  1. Baseline Commit: https://github.com/nvidia-cosmos/cosmos-predict2.5/commit/315e424d59ad132e6f6f9e63c24f12a51e0dfb73