Write a CUDA program for computing the dot product of a vector in parallel with 
each row of a matrix. The inputs are a data matrix similar to the format in the 
Chi2 program and a vector in separate files. The program should output the the 
result of the dot products. For example if the input is

1 2 0
1 1 0
1 2 1

and w = (2, 4, 6)

then your program should output

10
6
16

Compute the dot products in parallel your kernel function. You will have to
transpose the data matrix in order to get coalescent memory access. 

Submit your assignments by copying your program to your AFS course folder
/afs/cad/courses/ccs/s16/cs/732/004/<UCID>. The assignment is due on
Feb 8th, 2016.