Improving performance of triangular Matrix-Vector BLAS routines on GPUs.
Publisher
IOS Press
ISBN Number
978-1-61499-040-6
Abstract
CUBLAS is a widely used implementation of BLAS (Basic Linear Algebra Subprograms) for NVIDIA CUDA Graphical Processing Units (GPUs). The aim of this paper is to show that the performance of the selected Level 2 BLAS routines for working with triangular matrices can be improved using some optimization techniques suitable for GPUs like using shared memory and coalesced memory access. We present new implementation of the routines xTRMV and xTRSV. The results of experiments carried out on two GPU architectures: Tesla M2050 and GeForce GTX 260 show that these new implementations are up to 500% faster than corresponding routines from CUBLAS Library.
Historia zmian
Data aktualizacji: 18/02/2016 - 15:10; autor zmian: Piotr Gawron (gawron@iitis.pl)