GPU Hardware Features: NVIDIA and AMD
A comprehensive overview of modern GPU architectures and their unique features
Posted by Vipul Sharma
on October 22, 2025 ·
2 mins read
NVIDIA GPU Architectures
Here's a breakdown of NVIDIA's recent GPU architectures and their compute capabilities:
- Sm89: Ada (RTX 4090, 4080, 4070) → Compute Capability: 8.9
- Sm90: Hopper (H100, H200) → Compute Capability: 9.0
- Sm100: Blackwell Datacenter (B200) → Compute Capability: 10.0
- Sm120: Blackwell Consumer (RTX 5090, 5080, 5070) → Compute Capability: 12.0
AMD GPU Architectures
To be completed - currently gathering information
One interesting observation: AMD uses the term "Matrix Core" instead of NVIDIA's "Tensor Core" terminology.
Key AMD Resources
Key Hardware Features
Warp Management
- Warps
- Warp groups
- Warp specialization
Memory Features
- TMA (Tensor Memory Accelerator)
- Distributed shared memory
- Tensor memory (TMEM)
Precision Support
Hardware-supported vs. simulated precision formats:
Software Tools and Frameworks
Warp Specialization Support
- Triton and Gluon - Support for warp specialization
- PyTorch and TLX - Warp specialization support
- JAX Pallas - Blackwell matmul support
Useful Links
DSL Considerations
- CuTe DSL - Currently no AMD support
- TMA on AMD - Unclear if AMD has TMA equivalent. Possible TDA implementation: Triton PR #8333
References
- NVIDIA PTX - Warp-Level Matrix Instructions
- Hacker News Discussion
- CUDA Programming Guide - Compute Capabilities
- CUTLASS Blackwell Functionality
- Jarmusch, A., Graddon, N. and Chandrasekaran, S. (2025) "Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks." arXiv. https://doi.org/10.48550/arXiv.2507.10789
- Luo, W. et al. (2024) "Benchmarking and Dissecting the Nvidia Hopper GPU Architecture." arXiv. https://doi.org/10.48550/arXiv.2402.13499
- Luo, W. et al. (2025) "Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis." arXiv. https://doi.org/10.48550/arXiv.2501.12084
- Abdelkhalik, H. et al. (2022) "Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis." arXiv. https://doi.org/10.48550/arXiv.2208.11174
- Luhnen, T., Marschner, T. and Lal, S. "Benchmarking Thread Block Cluster - 2CTA MMA."