CUDA Compute Capability

NVIDIA GPU Compatibility Table with CUDA Versions

CUDA Compute Capability (CC) is the version of NVIDIA GPU hardware capabilities. It determines which CUDA features are available: Tensor Cores, FP16, BF16, INT8, dynamic parallelism and more.

Shown: 10 GPU
# GPU CC VRAM FP32
1 NVIDIA GeForce GTX 560 Ti 448 2.0 1 GB 1,312.0 TFLOPS
2 NVIDIA GeForce GTX 560 Ti 2.1 1 GB 1,263.0 TFLOPS
3 NVIDIA Quadro K4000 3.0 3 GB 1,244.0 TFLOPS
4 NVIDIA Quadro P600 Mobile 6.1 4 GB 1,244.0 TFLOPS
5 NVIDIA H200 SXM 141 GB 9.0 141 GB 66.91 TFLOPS
6 NVIDIA H200 NVL 9.0 141 GB 60.32 TFLOPS
7 NVIDIA H100 SXM5 96 GB 9.0 96 GB 66.91 TFLOPS
8 NVIDIA H100 SXM5 94 GB 9.0 94 GB 66.91 TFLOPS
9 NVIDIA B200 SXM 192 GB 10.0 96 GB 62.08 TFLOPS
10 NVIDIA B100 10.1 96 GB 62.08 TFLOPS

What is CUDA Compute Capability?

CUDA Compute Capability (CC) is a numeric identifier that defines the set of hardware features of an NVIDIA GPU. Format: major.minor, where major is the architecture generation, and minor is an incremental improvement within the generation.

For example, CC 7.0 is Volta (Tesla V100), CC 8.0 is Ampere (A100), CC 8.9 is Ada Lovelace (RTX 4090), CC 9.0 is Hopper (H100). Each version defines which instructions and features are available on the GPU.

When Do You Need to Know Compute Capability?

Model Training. PyTorch and TensorFlow use CC to select optimal kernels. Mixed precision training (FP16/BF16) requires at least CC 7.0 (Volta). BF16 requires CC 8.0+. FP8 requires CC 8.9+.

Inference. TensorRT and vLLM optimize models for a specific CC. INT8 quantization requires CC 6.1+, INT4 (via CUTLASS) requires CC 7.5+. Flash Attention 2 requires CC 8.0+.

Compiling CUDA Code. When building with nvcc, you need to specify the target architecture: -gencode arch=compute_80,code=sm_80. Wrong architecture means the code won't run or will fall back to JIT compilation with performance loss.

Choosing a GPU for Rent. If your project requires 3rd generation Tensor Cores (BF16, TF32) — you need CC 8.0+ (A100, A30). For Transformer Engine (FP8) — CC 8.9+ (L40S, RTX 4090) or CC 9.0 (H100).

Compute Capability Versions

CC 3.x (Kepler) — basic CUDA support, dynamic parallelism. Deprecated, not supported in CUDA 12+.

CC 5.x (Maxwell) — improved power efficiency, FP32 operations. Support is ending.

CC 6.x (Pascal) — FP16 (with limitations), INT8, NVLink. P100 (CC 6.0) — first GPU with full FP16 support.

CC 7.0 (Volta) — 1st generation Tensor Cores, hardware FP16. V100 — ML standard until 2022.

CC 7.5 (Turing) — 2nd generation Tensor Cores, INT8/INT4, RT Cores. T4 — the most popular inference GPU.

CC 8.0 (Ampere, GA100) — 3rd generation Tensor Cores, BF16, TF32, Sparsity. A100 — the LLM standard.

CC 8.6 (Ampere, GA10x) — consumer Ampere: RTX 3090/3080. BF16 via Tensor Cores.

CC 8.9 (Ada Lovelace) — 4th generation Tensor Cores, FP8, DLSS 3. RTX 4090, L40S.

CC 9.0 (Hopper) — Transformer Engine, native FP8, DPX instructions. H100 — the best GPU for LLM training.

CC 10.0 (Blackwell) — 5th generation Tensor Cores, FP4, Secure AI. B200, GB200.

CC 12.0 (Vera Rubin) — next generation, announced by NVIDIA.

What Compute Capability Affects

CUDA Toolkit Compatibility. Each CUDA Toolkit version supports a specific range of CC. For example, CUDA 12.x supports CC 5.0+, and CUDA 11.8 is the last version supporting CC 3.5 (Kepler). If your GPU is too old, the new CUDA Toolkit won't work.

Available Data Types. FP16 Tensor — CC 7.0+. BF16 — CC 8.0+. TF32 — CC 8.0+. FP8 — CC 8.9+. FP4 — CC 10.0+. This directly affects training and inference speed.

Library Performance. cuDNN, cuBLAS, NCCL, Flash Attention and other libraries select optimal kernels based on CC. Code compiled for sm_80 doesn't take advantage of sm_89 or sm_90 features.

Framework Support. PyTorch 2.x recommends CC 7.0+. Some operations (Flash Attention) require CC 8.0+. vLLM recommends CC 8.0+ (Ampere and newer) for optimal LLM inference.