Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
Muhammad Osama,D. Merrill,2 作者,John Douglas Owens
TLDR
Stream-K is introduced, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra that provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.
摘要
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.
