GPU Hardware Features: NVIDIA and AMD

A comprehensive overview of modern GPU architectures and their unique features

Posted by Vipul Sharma on October 22, 2025 · 2 mins read

NVIDIA GPU Architectures

Here's a breakdown of NVIDIA's recent GPU architectures and their compute capabilities:

  • Sm89: Ada (RTX 4090, 4080, 4070) → Compute Capability: 8.9
  • Sm90: Hopper (H100, H200) → Compute Capability: 9.0
  • Sm100: Blackwell Datacenter (B200) → Compute Capability: 10.0
  • Sm120: Blackwell Consumer (RTX 5090, 5080, 5070) → Compute Capability: 12.0

AMD GPU Architectures

To be completed - currently gathering information

One interesting observation: AMD uses the term "Matrix Core" instead of NVIDIA's "Tensor Core" terminology.

Key AMD Resources

Key Hardware Features

Warp Management

  • Warps
  • Warp groups
  • Warp specialization

Memory Features

  • TMA (Tensor Memory Accelerator)
  • Distributed shared memory
  • Tensor memory (TMEM)

Precision Support

Hardware-supported vs. simulated precision formats:

  • MXFP8
  • FP8
  • MXFP4
  • NVFP4

Software Tools and Frameworks

Warp Specialization Support

  • Triton and Gluon - Support for warp specialization
  • PyTorch and TLX - Warp specialization support
  • JAX Pallas - Blackwell matmul support

Useful Links

DSL Considerations

  • CuTe DSL - Currently no AMD support
  • TMA on AMD - Unclear if AMD has TMA equivalent. Possible TDA implementation: Triton PR #8333

References

  1. NVIDIA PTX - Warp-Level Matrix Instructions
  2. Hacker News Discussion
  3. CUDA Programming Guide - Compute Capabilities
  4. CUTLASS Blackwell Functionality
  5. Jarmusch, A., Graddon, N. and Chandrasekaran, S. (2025) "Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks." arXiv. https://doi.org/10.48550/arXiv.2507.10789
  6. Luo, W. et al. (2024) "Benchmarking and Dissecting the Nvidia Hopper GPU Architecture." arXiv. https://doi.org/10.48550/arXiv.2402.13499
  7. Luo, W. et al. (2025) "Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis." arXiv. https://doi.org/10.48550/arXiv.2501.12084
  8. Abdelkhalik, H. et al. (2022) "Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis." arXiv. https://doi.org/10.48550/arXiv.2208.11174
  9. Luhnen, T., Marschner, T. and Lal, S. "Benchmarking Thread Block Cluster - 2CTA MMA."