| Nvidia/CUDA | AMD OpenCL | Note |
|---|---|---|
| Task | Task | Kernel launches or API commands (e.g., data movement) |
| CUDA Stream | Command Queue | A stream is a sequence of commands (possibly issued by different host threads) that execute in order. |
| Command Processor | CP sends work packages (i.e. workgroups of work-items in HIP) to the Compute Units (CUs) | |
| Kernel | Kernel | Function executed on GPU |
| Grid | NDRange | A kernel is executed as a grid of blocks of threads. |
| Thread Block (TB) / Cooperative Thread Array (CTA) | Workgroup (WG) | Basic workload unit assigned to an SM or CU (stay on the same SM for the whole execution). Each kernel is split into multiple CAs, and the #CTAs is controlled by the application. Typically, hardware limits 1024 All threads within a block can be synchronized using __syncthreads(), which forces all threads in the threads per block block to wait. |
| Warp | Wavefront (WF/WV) | A group of threads (e.g., 32 for NV, 64 for AMD) executing in lockstep (i.e., run the same inst, follow the same control-flow path). Warp represents the SIMD width. #WFs/WG is chosen by developers. |
| Thread | Work-item | A basic element to be processed. |
| GPU Device | GPU Device | |
| GPU Processing Cluster (GPC) | Shader Engine (SE) | A collection of CUs organized into one or two SHs |
| Texture Processing Cluster (TPC) | Shader Array (SH) | A group made up of several SMs or CUs. |
| Stream Multiprocessor (SM) / Multiprocessor | Compute Unit (CU) | Fundamental unit of computation, replicated multiple times on a GPU. One SM can run multi CUDA blocks depending on the resources. GV100: max 64 warps/SM, 32 TBS/SM |
| Processing Block / Sub-partition | SIMD Unit | A group of CUDA cores (e.g., 16). Each sub-partition has a warp scheduler and dispatch unit. Each SIMD is assigned to its own PC and instruction buffer for 10 WFs. |
| Stream Processor (SP) / CUDA Core / FPxx Core | Stream Processor / SIMD Lane / VALU Lane / ALU | A parallel execution lane comprising an SM or CU |
| Shared Memory (SMEM) | Scratchpad / Local Data Share (LDS) | All threads in a block can share shared memory. Allocated to a WG, shared by WFs in a WG. |
| Global Data Share (GDS) | Used by all wfs or a kernel on all CUs. |
Last active
August 2, 2023 01:38
-
-
Save getianao/7fab4f65297b83c53f6babd9be86c56a to your computer and use it in GitHub Desktop.
GPU Terminology
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment