getianao/NVvsAMD.md

Last active August 2, 2023 01:38

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/getianao/7fab4f65297b83c53f6babd9be86c56a.js"></script>
Save getianao/7fab4f65297b83c53f6babd9be86c56a to your computer and use it in GitHub Desktop.

Download ZIP

GPU Terminology

Raw

GPU Terminology

Nvidia/CUDA	AMD OpenCL	Note
Task	Task	Kernel launches or API commands (e.g., data movement)
CUDA Stream	Command Queue	A stream is a sequence of commands (possibly issued by different host threads) that execute in order.
	Command Processor	CP sends work packages (i.e. workgroups of work-items in HIP) to the Compute Units (CUs)
Kernel	Kernel	Function executed on GPU
Grid	NDRange	A kernel is executed as a grid of blocks of threads.
Thread Block (TB) / Cooperative Thread Array (CTA)	Workgroup (WG)	Basic workload unit assigned to an SM or CU (stay on the same SM for the whole execution). Each kernel is split into multiple CAs, and the #CTAs is controlled by the application. Typically, hardware limits 1024 All threads within a block can be synchronized using __syncthreads(), which forces all threads in the threads per block block to wait.
Warp	Wavefront (WF/WV)	A group of threads (e.g., 32 for NV, 64 for AMD) executing in lockstep (i.e., run the same inst, follow the same control-flow path). Warp represents the SIMD width. #WFs/WG is chosen by developers.
Thread	Work-item	A basic element to be processed.
GPU Device	GPU Device
GPU Processing Cluster (GPC)	Shader Engine (SE)	A collection of CUs organized into one or two SHs
Texture Processing Cluster (TPC)	Shader Array (SH)	A group made up of several SMs or CUs.
Stream Multiprocessor (SM) / Multiprocessor	Compute Unit (CU)	Fundamental unit of computation, replicated multiple times on a GPU. One SM can run multi CUDA blocks depending on the resources. GV100: max 64 warps/SM, 32 TBS/SM
Processing Block / Sub-partition	SIMD Unit	A group of CUDA cores (e.g., 16). Each sub-partition has a warp scheduler and dispatch unit. Each SIMD is assigned to its own PC and instruction buffer for 10 WFs.
Stream Processor (SP) / CUDA Core / FPxx Core	Stream Processor / SIMD Lane / VALU Lane / ALU	A parallel execution lane comprising an SM or CU
Shared Memory (SMEM)	Scratchpad / Local Data Share (LDS)	All threads in a block can share shared memory. Allocated to a WG, shared by WFs in a WG.
	Global Data Share (GDS)	Used by all wfs or a kernel on all CUs.

References

AMD GPU Hardware Basics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment