The game

We provide you a CUDA program, mytoy.cu, simple yet complete to change a variety of features in your software and quickly see how they affect GPU performance in the architecture underneath.

The kernel computes operations in parallel on every single element of a sparse matrix. It is the shortest irregular program you may write, allowing you to easily benefit from the new capabilities Kepler is endowed with (like SMX capabilities or Hyper-Q), along with some other mechanisms already existing in CUDA. You as a CUDA programmer have to select the optimal parameters to maximize performance, and more importantly, to deploy the parallelization strategy. Depending on your programming skills and the GPU hardware you have, several levels of difficulty are available for you to choose:

  • Basic: Play with warps, threads and blocks.
  • Intermediate: Change the kernel launches.
  • Advanced: Assign kernels to streams the right way.

Points of interest are tagged in the code with the “MU” label followed by a number, according to the following table:

Control points (within mytoy.cu)
Tag Description/purpose Choices Investigate
MU1 Selects data type int/float/double (initially: double) ALU/FPU performance
MU2 Selects type of operation add/mul/div (initially: add) Arithmetic latency
MU3 Calculates number of blocks
and their size
32, 64, 128, 192, 256, 384, 512, 1024 (initially: 1024) Parallel deployment
MU4 Declares streams All kernels in the same stream
or one stream for each kernel (initial choice)
Parallel deployment
MU5 Launch kernels on streams This is tightly coupled with MU4 Parallel deployment

 

Input parameters to mytoy.cu (provided from the Linux shell)
Position Meaning Comments/hints
First The number of GPU used for running the code Usually 0 (particularly when using cloud computing)
Second The number of operations performed
on each matriz nonzero
Affects operational intensity. Useful to obtain
all the coordinates to draw the roofline model
for the target GPU
Third The file name for the input sparse matrix (you can find them in the sparsematrices directory) sparsematrices/samplematrix.rua is a sample
Fourth The file name to write the output results Useful to validate GPU computations

 

An example of command for executing the program from the Shell (boldfaced):

/home/ujaldon> ./mytoy 0 4 sparsematrices/samplematrix.rua myoutputfile.txt