GPU Hardware Features: NVIDIA and AMD

A comprehensive overview of modern GPU architectures and their unique features

Posted by Vipul Sharma on October 22, 2025 · 2 mins read

Here's a breakdown of NVIDIA's recent GPU architectures and their compute capabilities:

To be completed - currently gathering information

One interesting observation: AMD uses the term "Matrix Core" instead of NVIDIA's "Tensor Core" terminology.

ROCm Compatibility Matrix
GPU Architecture Specifications
Matrix Cores in CDNA - Low bit matmul instructions
Composable Kernel - AMD's equivalent to NVIDIA's CUTLASS
ROCm AIter - Contains FP8 matmul kernels, indicating AMD hardware FP8 matrix core support
WMMA on RDNA3 - Similar to WGEMM in NVIDIA

Hardware-supported vs. simulated precision formats:

CuTe DSL - Currently no AMD support
TMA on AMD - Unclear if AMD has TMA equivalent. Possible TDA implementation: Triton PR #8333

NVIDIA PTX - Warp-Level Matrix Instructions
Hacker News Discussion
CUDA Programming Guide - Compute Capabilities
CUTLASS Blackwell Functionality
Jarmusch, A., Graddon, N. and Chandrasekaran, S. (2025) "Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks." arXiv. https://doi.org/10.48550/arXiv.2507.10789
Luo, W. et al. (2024) "Benchmarking and Dissecting the Nvidia Hopper GPU Architecture." arXiv. https://doi.org/10.48550/arXiv.2402.13499
Luo, W. et al. (2025) "Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis." arXiv. https://doi.org/10.48550/arXiv.2501.12084
Abdelkhalik, H. et al. (2022) "Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis." arXiv. https://doi.org/10.48550/arXiv.2208.11174
Luhnen, T., Marschner, T. and Lal, S. "Benchmarking Thread Block Cluster - 2CTA MMA."

← Previous Post