Cosmos-Predict2.5-2B Inference
NVIDIA H200 vs AMD MI300X

Apr 7, 2026
Research

Single-GPU inference at 720p · Cosmos-Predict2.5-2B post-trained ·
bf16 precision · 36 diffusion steps

TL;DR

Side-by-side output on NVIDIA H200 (Hopper, FlashAttention3) vs AMD MI300X for Cosmos Predict 2.5 at equal quality.
~1.27–1.49× faster wall-clock on MI300X across the scenarios below (encode + diffusion + decode).
We are continuing to push performance further: a second version under development targets up to an additional 50% reduction in runtime.
We would like to thank HotAisle, our AMD cloud, for bare metal access and general support.

Abstract

We ported the Cosmos family of models to AMD GPUs and present results for Cosmos Predict 2.5, a world foundation model for large-scale generative simulation in physical AI, running end-to-end on AMD MI300X. In our benchmarks, the AMD implementation achieved ~1.4× speedup over the NVIDIA baseline (Hopper, FlashAttention3) at equal quality.

To our knowledge, this is among the first production-grade deployments of a world model on AMD GPUs (alongside Micro-World), enabling serious AI simulation workloads outside the NVIDIA ecosystem. It also serves as a concrete proof point that for large diffusion-based models, AMD hardware is already competitive, and in some cases superior, to NVIDIA.

We are continuing to push performance further: a second version, currently under development, delivers up to an additional 50% reduction in runtime.

Video demos^[1][1]^[2][2]

[1] These examples are from cosmos assets https://github.com/nvidia-cosmos/cosmos-predict2.5/tree/main/assets/base

[2] AMD peak memory is 66.97GB , NVIDIA peak memory 55.34GB

Bus Terminal

Image + Text Speedup: 1.44×

NVIDIA H200

Encode 7.436s Diffusion 242.0s Decode 5.402s Total 254.840s

AMD MI300X

Encode 3.954s Diffusion 166.0s Decode 7.287s Total 177.240s

Bus Terminal (Long)

Image + Text Speedup: 1.39×

NVIDIA H200

Encode 12.237s Diffusion 726.0s Decode 14.071s Total 752.310s

AMD MI300X

Encode 11.825s Diffusion 506.0s Decode 21.808s Total 539.630s

Robot Pouring

Video + Text Speedup: 1.38×

NVIDIA H200

Encode 7.421s Diffusion 242.0s Decode 5.365s Total 254.790s

AMD MI300X

Encode 4.013s Diffusion 174.0s Decode 7.268s Total 185.280s

Robot Tightening

Text Speedup: 1.27×

NVIDIA H200

Encode 7.502s Diffusion 242.0s Decode 5.388s Total 254.890s

AMD MI300X

Encode 3.946s Diffusion 190.0s Decode 7.282s Total 201.230s

Robot Welding

Image + Text Speedup: 1.41×

NVIDIA H200

Encode 7.330s Diffusion 242.0s Decode 5.365s Total 254.700s

AMD MI300X

Encode 3.942s Diffusion 170.0s Decode 7.256s Total 181.200s

Sand Mining

Video + Text Speedup: 1.49×

NVIDIA H200

Encode 7.360s Diffusion 242.0s Decode 5.413s Total 254.770s

AMD MI300X

Encode 4.022s Diffusion 160.0s Decode 7.275s Total 171.300s

Snowy Stop Light

Text Speedup: 1.30×

NVIDIA H200

Encode 7.449s Diffusion 242.0s Decode 5.381s Total 254.830s

AMD MI300X

Encode 3.947s Diffusion 185.0s Decode 7.275s Total 196.220s

Learn more

Baseline Commit: https://github.com/nvidia-cosmos/cosmos-predict2.5/commit/315e424d59ad132e6f6f9e63c24f12a51e0dfb73

TL;DR

Side-by-side output on NVIDIA H200 (Hopper, FlashAttention3) vs AMD MI300X for Cosmos Predict 2.5 at equal quality.
~1.27–1.49× faster wall-clock on MI300X across the scenarios below (encode + diffusion + decode).
We are continuing to push performance further: a second version under development targets up to an additional 50% reduction in runtime.
We would like to thank HotAisle, our AMD cloud, for bare metal access and general support.

These examples are from cosmos assets https://github.com/nvidia-cosmos/cosmos-predict2.5/tree/main/assets/base
AMD peak memory is 66.97GB , NVIDIA peak memory 55.34GB
Baseline Commit: https://github.com/nvidia-cosmos/cosmos-predict2.5/commit/315e424d59ad132e6f6f9e63c24f12a51e0dfb73

Cosmos-Predict2.5-2B InferenceNVIDIA H200 vs AMD MI300X

TL;DR

Abstract

Video demos[1][1][2][2]

Bus Terminal

Bus Terminal (Long)

Robot Pouring

Robot Tightening

Robot Welding

Sand Mining

Snowy Stop Light

Cosmos-Predict2.5-2B Inference
NVIDIA H200 vs AMD MI300X

Video demos^[1][1]^[2][2]