How to play
The departure point of the mytoy.cu program provides a CUDA implementation covering the following strategy:
- Kernels take care of the computation required by each matrix column independently, and threads are assigned a single matrix element to perform all operations logically in parallel (physically only if there are enough hardware resources).
- The number of blocks per kernel is minimized.
- The number of threads per block is maximized.
Within mytoy.cu, the user will find some comments to convert the program from deploying one stream per column to deploying one stream for the whole matrix. Think which parallelization mechanism fits better to the input matrix you are using.
Some ideas for the programmer to play with this code:
- Increase granularity: Assign several matrix elements to a single thread. Measure performance and study positive/negative effect compared to the original way of deploying parallelism.
- Increase the number of blocks. Analyze performance obtained with respect to the departure point.
- Reduce the number of threads per block. Analyze the GFLOPS obtained.
- Draw the entire chart which characterizes the roofline model for the GPU. To do so, play with the second input parameter of the mytoy.cu program to cover all points of the horizontal axis. You can interpolate points in between power of two coordinates, for example, 1, 2, 4, 8, 16, 32 and 64. Discuss the shape of this chart and extract valuable conclusions about bottlenecks and possible architectural enhancements on future GPU models.
But this is just the beginning. There are many more ideas to investigate and implement in mytoy.cu. Show us your skills and good luck with CUDA!