The game
We provide you a CUDA program, mytoy.cu, simple yet complete to change a variety of features in your software and quickly see how they affect GPU performance in the architecture underneath.
The kernel computes operations in parallel on every single element of a sparse matrix. It is the shortest irregular program you may write, allowing you to easily benefit from the new capabilities Kepler is endowed with (like SMX capabilities or Hyper-Q), along with some other mechanisms already existing in CUDA. You as a CUDA programmer have to select the optimal parameters to maximize performance, and more importantly, to deploy the parallelization strategy. Depending on your programming skills and the GPU hardware you have, several levels of difficulty are available for you to choose:
- Basic: Play with warps, threads and blocks.
- Intermediate: Change the kernel launches.
- Advanced: Assign kernels to streams the right way.
Points of interest are tagged in the code with the “MU” label followed by a number, according to the following table:
Tag | Description/purpose | Choices | Investigate |
---|---|---|---|
MU1 | Selects data type | int/float/double (initially: double) | ALU/FPU performance |
MU2 | Selects type of operation | add/mul/div (initially: add) | Arithmetic latency |
MU3 |
Calculates number of blocks and their size |
32, 64, 128, 192, 256, 384, 512, 1024 (initially: 1024) | Parallel deployment |
MU4 | Declares streams |
All kernels in the same stream or one stream for each kernel (initial choice) |
Parallel deployment |
MU5 | Launch kernels on streams | This is tightly coupled with MU4 | Parallel deployment |
Position | Meaning | Comments/hints |
---|---|---|
First | The number of GPU used for running the code | Usually 0 (particularly when using cloud computing) |
Second |
The number of operations performed on each matriz nonzero |
Affects operational intensity. Useful to obtain all the coordinates to draw the roofline model for the target GPU |
Third | The file name for the input sparse matrix (you can find them in the sparsematrices directory) | sparsematrices/samplematrix.rua is a sample |
Fourth | The file name to write the output results | Useful to validate GPU computations |
An example of command for executing the program from the Shell (boldfaced):
/home/ujaldon> ./mytoy 0 4 sparsematrices/samplematrix.rua myoutputfile.txt