Using the template files mm_mpi.cpp and mm_cuda.cu and Makefile, write a tiled matrix multiply program in MPI and CUDA. The two files, mm_mpi.cpp and mm_cuda.cu are taken from the working version, except some functions are deleted for you to fill. The files compile and the results actually pass the test of comparing the host version and the cuda version because they are initialiazed to zeros :-). So don't be fooled. You need to fill in the missing lines and functions as indicated in the files.
If you are not sure, ask. Or come to the office hours. See the class web page for the Spring 2025 office hours.
The master randomly generates both mat A and mat B. Assuming the matrix dimension is an order of 2 number, mat A be will be equally distributed to nprocs processes while mat B is broadcat.
For each process, perform matrix multiply on n/nprocs rows.
The master collects n/nprocs rows from nprocs processes.
Compare the MPI+CUDA version with the host version.
They must match. Otherwise, all futile. You wasted a lot of electrons. Do it again until you match the two.