For the testing, i run you code using double for FTYPE, created some
artificial A, and B and compared your results to the output of my own
version of a matrix mult routing (column major storage only).

Below you may review the performance of your functions 
  (cm is for mmulcm and rm mmulrm) 
on two instances of 256x256, 512x512 matrices.
Number are in Megaflops per second (flop count in matrix mult is
    n*n*n + n*n*(n-1) for adds/mults respectively.
Higher numbers are better
The instructor's own implementation is alexg (mmulcm).

n=       256    512
         cm/rm cm   /rm
alexg    98/   70.00/
user2    27/23 20.60/14.10
user3    31/33 20.90/21.06    
user4    30/28 21.00/18.04
user5    32/27 20.88/18.05     
user6    33/27 20.77/14.40
user7    38/36 21.03/21.19
user8    36/29 21.13/18.09       

User7's implementation is marginally faster than the remaining ones.
He receives 10 (announced) bonus points for that.