Skip to content

Instantly share code, notes, and snippets.

@getianao
Last active August 2, 2023 01:38
Show Gist options
  • Select an option

  • Save getianao/7fab4f65297b83c53f6babd9be86c56a to your computer and use it in GitHub Desktop.

Select an option

Save getianao/7fab4f65297b83c53f6babd9be86c56a to your computer and use it in GitHub Desktop.
GPU Terminology

GPU Terminology

Nvidia/CUDA AMD OpenCL Note
Task Task Kernel launches or API commands (e.g., data movement)
CUDA Stream Command Queue A stream is a sequence of commands (possibly issued by different host threads) that execute in order.
Command Processor CP sends work packages (i.e. workgroups of work-items in HIP) to the Compute Units (CUs)
Kernel Kernel Function executed on GPU
Grid NDRange A kernel is executed as a grid of blocks of threads.
Thread Block (TB) / Cooperative Thread Array (CTA) Workgroup (WG) Basic workload unit assigned to an SM or CU (stay on the same SM for the whole execution). Each kernel is split into multiple CAs, and the #CTAs is controlled by the application. Typically, hardware limits 1024 All threads within a block can be synchronized using __syncthreads(), which forces all threads in the threads per block block to wait.
Warp Wavefront (WF/WV) A group of threads (e.g., 32 for NV, 64 for AMD) executing in lockstep (i.e., run the same inst, follow the same control-flow path). Warp represents the SIMD width. #WFs/WG is chosen by developers.
Thread Work-item A basic element to be processed.
GPU Device GPU Device
GPU Processing Cluster (GPC) Shader Engine (SE) A collection of CUs organized into one or two SHs
Texture Processing Cluster (TPC) Shader Array (SH) A group made up of several SMs or CUs.
Stream Multiprocessor (SM) / Multiprocessor Compute Unit (CU) Fundamental unit of computation, replicated multiple times on a GPU. One SM can run multi CUDA blocks depending on the resources. GV100: max 64 warps/SM, 32 TBS/SM
Processing Block / Sub-partition SIMD Unit A group of CUDA cores (e.g., 16). Each sub-partition has a warp scheduler and dispatch unit. Each SIMD is assigned to its own PC and instruction buffer for 10 WFs.
Stream Processor (SP) / CUDA Core / FPxx Core Stream Processor / SIMD Lane / VALU Lane / ALU A parallel execution lane comprising an SM or CU
Shared Memory (SMEM) Scratchpad / Local Data Share (LDS) All threads in a block can share shared memory. Allocated to a WG, shared by WFs in a WG.
Global Data Share (GDS) Used by all wfs or a kernel on all CUs.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment