Structure:
Makefile	:	Program makefile
general.c  	:	General Matrix multiplication routine which
			is memory inefficient.
			(see below for details)
matrix.h	:	Header files
			(redundant header information is included)
mult1.c		:	Main program to test general_mult.c
README		:	This file

------------------------------

Parallel matrix multiplication routines on the BSP
Files needed
1. Makefile
	% make all
   to build everything or
	% make mult1
   to build general_mult.c and mult1.c
2. mult1.c

	Main file for memory inefficient matrix multiplication
	invocation. The main() function works for 
	any processor geometry p=piXpj (i.e for any pi and pj). 
	This file is similar otherwise  to mult1.c.

	Matrices are block distributed among the processors. The main()
	function initializes relevant matrices on the fly.
	
	When it is built with make (make all or make mult)
	invoke
	% mult1
	to find out its parameters
	mult1 dimension nprocs pi pj debug runs
	For simplicity,  both pi and pj must divide dimension.

 	*  Two dimension X dimension matrices are created on the fly
 	* and distributed according to a block distribution
 	*      nprocs = pi X pj is the number of processors to be used
	*
 	*      pi  (y dimension or column size)
 	*                                      of processor geometry
 	*      pj  (x dimension or row    size)
	*      For a 2x3 matrix we have a processor allocation as follows
 	*        0  3
 	*        1  4  has pi=3 pj=2 nprocs=6
 	*        2  5
 	*      debug = 0 (false)
 	*              1 (true) prints matrices A, B, AXB
 	*      runs  Repeat _runs_ times the loop.
 	*/

3. general.c
	The memory inefficient matrix multiplication BSP algorithm
        It implements a memory inefficient matrix multiplication routine,
        on the BSP model suggested by Valiant (C.ACM 33(8),pp103-111,Aug 1990)
        Initially, two global $n \times n$ matrices $A_g $ and $B_g$ are
        distributed among  $p$ processors. The $p$ processors are divided
        into $pj$ groups of $pi$ processors each.  Element $(i,j)$ of $A_g$
        or $B_g$ is stored in the $i/pi$-th processor of the $j/pj$ processor
        group, that is, processor $(j/pj)*pi + i/pi$.
        Function {\tt multiply\_par} requires six arguments, the two input
        (distributed among the processors) matrices $A$ and $B$ each
        of dimension  (n/pi X n/pj), the result (already malloced)
        of dimension (n/pi X n/pj), the dimension n, pi and pj.
        $A$, $B$ and $C$ are ANSI-C pointers to a {\tt double} data type.
        We store matrices in the form of an on dimensional array. This way,
        element $(i,j)$ of a two-dimensional matrix is stored in position
        $j*n+i$ of say, $A$. All indices are in the range $0, \ldots, n-1$.

 	Each processor in one superstep reads the all the block it needs
 	to perform the matrix multiplication, and then it performs it.
  	Size of memory per processor instead of 3n^2/p is
  	4n^2 /p + 2n^2/q, where q=min{pi,pj}.
  	Matrix A is transposed prior to communication and computation; this
  	may increase efficiency due to locality - caching - issues.


4. Timing results (What you should expect)
   For _multiply_par in multiply_par.c
   SGI Power Challenge
		p=1	    	p=4
   n= 512   2.1sec (127Mfl)  0.60sec (110Mfl)
     1024  62.4sec (34Mfl)   4.57sec (117Mfl)

   IBM SP2
		p=1	    	p=4
   n= 512   5.7sec (46Mfl)   1.50sec (42Mfl)
     1024  45.3sec (47Mfl)   11.8sec (45Mfl)

   Cray T3D
		p=4		p=16		p=64
   n=1024     41.75(12.8Mfl) 9.48(14.14Mfl)  2.35(14.27Mfl) 
