/****   Copyright (c) 1997,2001-2003  Alexandros V. Gerbessiotis
 ****
 ****   Permission to use, copy, modify, and distribute this software,
 ****   and to incorporate it, in whole or in part, into other software,
 ****   is hereby granted without fee, provided that
 ****     (1) the above copyright notice and this permission notice appear in
 ****         all copies of the source code, and the above copyright notice
 ****         appear in clearly visible form on all supporting documentation
 ****         and distribution media, and
 ****     (2) any redistribution of the software, in original
 ****         or modified form, be without fee and subject to
 ****         these same conditions.
 ****   No guarantee is offered that the code works as
 ****   advertised or is absent of any, even damaging,
 ****   side-effects; use this code on your own personal risk
 ****/

__Before begin reading this file__ 
  __**BE WARNED**__

 ****   No guarantee is offered that the code works as
 ****   advertised or is absent of any, even damaging,
 ****   side-effects; use this code on your own personal risk

Section A includes directions on how to compile  test programs
that use some of the routines (C functions) for
broadcasting/parallel prefix. 

Section B includes a short description of the related functions.
More details are available before the source code of each function.

Section C describes file dependencies

Section D MPI vs BSPlib

Section E Some comments; Linking BSPlib options
**********
Section A.
==========

INSTRUCTIONS to edit/use main program for parallel prefix and/or
 broadcasting (it affects brdmain.c and ppfmain.c and accompanying
 aibrd.h and aippf.h)

---------------------------------------------------------------------------

1. The supplied code has been tested under both BSPlib and LamMPI.
   It compiles under both environments and executes without any
   problems.

2. BSPlib or LamMPI set up
 
   2a. Edit file ai.h  and make sure
       #define BSPLIB
    and
       #undef MPILIB  (default)
    i.e. one is defined and the other is NOT.

   2b. Edit file brdai.h or ppfai.h
     and choose either
      typedef int data;  
     or something else (eg.
      typedef float data; (default)
   2c. Depending on choice in 2b edit one of
       brdmain.c or  ppfmain.c and do the following
      edit operations if defaults need to be changed
      i.
       #define SUM sum4 
      or
       #define SUM sum3
       This decides whether in broadcasting or parallel
       prefix test routines source and destination are the same
       or not.
       By default: Source is sum3
       By default: Destination is SUM (whatever this is).
       NOTE: Some library function of MPIlib and/or BSPlib may not
       behave as expected.

      ii. 
       #define INT_TYPE
      and undef the other
       #define FLOAT_TYPE
      This is for printing debugging purposes. Choose one or the other
      based on the choices of step 2b.

      iii. If you want to see that the functions work as
      advertised 
        #define BRD_DEBUG

      A primitive form of checking is also performed by default;
      an error will be reported any way if the values computed are not
      the ones expected.
     
3. Edit Makefile. THe Makefile copies an executable to all processors
   as defined in ~/.bsptcphosts for BSPlib and  the boot host file
   (the one used for lamboot -v         MPI booting) in LAMMPI. In my
   implementations ccp is a csh script that uses rcp/scp to perform
   this "broadcast"/"copy" operation of the executable file.
   If you DON"T have something similar available, then delete all
   lines of the form
    % ccp XXX YYY
   where XXX is an executable and
         YYY is the destination directory 	~/YYY.
4. Ready to compile
   a. Compile brd under MPI # broadcasting functions
      % make clean;make mpibrd
   b. Compile brd under BSP
      % make clean;make bspbrd

   c. Compile ppf under MPI # parallel prefix/scan functions
      % make clean;make mpippf
   d. Compile ppf under BSP
      % make clean;make bspppf


5. Run
   % mpirun -np 16 mppf 16 15 14
   % bsprun -noload -local -npes 16 ppf 16 15 14

   where 16 (first arg) is degree of replication in tree based
   operations, 15 (second arg) is size of prefix array in data
   elements (total size 15*sizeof(data) in chars) and 14 number of
   runs (average and total time reported).


  SImilarly

   % mpirun -np 16 mbrd 16 15 14 13
   % bsprun -noload -local -npes 16 brd 16 15 14 13

   where 16 (first arg) is degree of replication in tree based
   operations, 15 (second arg) is size of  array to be broadcast,
   14 (third  arg) is processor index (0..MAXPROCS) that
   initiates the broadcast and
   13 number of runs (average and total time reported).

   MAXPROCS is one smaller than the argument to -np or -npes
6. Algorithms implemented

   a. Broadcasting

    BSPLibrary is Jonathan Hill's bsp_bcast native level1 BSPlib
    primitive operation.
    MPILIbrary is MPI_Bcast
    2-phaseBRD is the two phase algorithms which is optimal for
    cases where n (second argument to ?brd) is much larger than
    MAXPROCS (argument to -np or -npes)

    Tree BRD is a d-ary replication based broadcasting algorithm that
    extends the corresponding EREW PRAM algorithm that
    involves binary replication.

   b. Parallel Prefix
    BSPlibrary is Jonathan Hill's bsp_prefix implementation
    MPIlibrary is MPI_Scan.
    2-phasePPF is similar to 2-phase BRD for prefix.
    Ladner is the tree based Ladner-Fisher algorithm generalized
    from a binary tree (see F. T. Leighton Parallel Algorithm book)
    to any d-ary tree implementation.
    ButterflyPPF is the generalization to a d-ary setting of the
     butterfly algorithm for parallel prefix. Note that in this latter
     case the operator MUST be commutative NOT JUST ASSOCIATIVE.

   In either case (broadcasting parallel prefix) DecideBest was
   supposed to choose one or the other of 2-phase and tree-based
   algorithms. CUrrently this feature is disabled; if you wish to
   enable it comment out line in brdmain.c of ppfmain.c as
   desired.

   To make sense out of this feature you must supply appropriate values
   for the BSP parameters of your host machine. The default ones
   for MPI are in aimpi.c and probably are not very well suited. Edit
   this file if you have a better estimate. BSPlib values are automatically
   read by a default file. Do as instructed in the BSPlib manuals.

   in addition some cutoff values for broadcasting and parallel prefix
   are define in brdai.h and ppfai.h

**********
Section B.
==========

_Note_

Some of the functions originate from functions that were originally 
written purely under BSPlib. Making such functions work under 
LAMMPI required some code rewriting (related to how communication is 
performed). The current name
of the function that works under both BSPlib and LAMMPI is the ai* name;
the original name starts with bsp*. Note: Just because a function is called
bsp* this doesnot mean is identical to an older function of the same name
that one may have obtained separately from my current or previous Web-page.
The name may be the same but the semantics may not be.

ai2mbcast
bsp_2_mbroadcast : Broadcasts a multiple message  from any
		   processor to the remaining processors in two stages		
		   The algorithm _may_ suffer from loss of performance
		   when message-size/number of processors  is not a 
		   multiple of two, in various 64-bit machines, due 
		   to a slowdown of the native memcpy function when 
		   blocks of non-aligned memory are transferred.
	NOTICE: The two-phase variant suffers from memory-alignment problems
		on at least the SGI Power Challenge(this has to do with the
		SGI implementation of memcpy, rather than the code).Test it 
                with message sizes of length, 4k, 4k+1 and 4k+2 integers, 
	        for any k, to verify or not that such a problem exists on 
		your machine. 

ai2ombcast
bsp_F_mbroadcast: A variation of the previous one. 

aitmbcast
bsp_mbroadcast: Builds a tree of varying degree to perform the broadcasting.
		One should choose this one for small message broadcasts. For
		large ones it's better to use one of the previous functions. 
		Degree can by any integer >1.
aibcast
bsp_broadcast:  Use this one for 1-message broadcasts (of any data type),
		for more robustness .
aimbcast
bsp_multibroadcast: Let the machine choose the best broadcasting algorithm.
		    It guesses fairly well on the SGI. Needs fine tuning for
		    other machines. Modify *_CUTOFF parameters in mybsp.h,
		    for your particular architecture.

ai2mprefix
bsp_2_mprefix:	Same comments as in bsp_2_mbroadcast apply.
aitmprefix
bsp_mprefix  :	Ladner and Fisher's parallel prefix algorithm on a tree
		of varying degree (regular is assumed). It consists of
		2*depth phases.
aitmmprefix
bsp_mmprefix :	A variation of a parallel prefix algorithm with depth phases
		but more communication. In practice it's about twice as
		fast as the previous one.
		(and codewise, 20% of its size in lines)
aiprefix
bsp_prefix   :	The equivalent of bsp_broadcast
aimprefix
bsp_multiprefix: The equivalent  of bsp_multibroadcast

	If you want to choose your own tree (degree) here are some hints:
  -- Use the BSP cost model to figure one what seems better for your
     machine by dividing L/g, and choosing degree so that 
     degree * message size is about L/g
  -- I am using brdmain.c to choose the appropriate parameters.

Good luck,
Alexandros Gerbessiotis

----
BUGS
----
my_estimate_nprocs was double. Fixed to int. Thanks to F.Petrini for
                               reporting it.


**********
Section C
---------

a.  File dependencies
    brdai.h  ---> brd*.c
    ppfai.h  ---> ppf*.c

    ai.h     ---> brdai.h
    ai.h     ---> ppfai.h

    ai.h     ---> aimisc.h
    aimisc.h ---> aimisc.c
    aimisc.h ---> brdmain.c
    aimisc.h ---> ppfmain.c

    ai.h     ---> aimpi.c
    aimpi.h  ---> aimpi.c

b. What are the various files

   ai stands for architecture independent
   ai.h  : Defines ai primitives under both BSPlib abd MPIlib for
           direct remote memory access communication. The syntax is that
           of BSPlib for put and get.
   aimisc.* : Functions needed by brdmain.c ppfmain.c
   aimpi.*  : Functions coupled with ai.h to deliver BSPlib compatibility
              from MPI.
   brdaia.* : Broadcasting functions.
   ppfaia.* : Parallel PreFix function
   brdmain.c: Test program for brdaia*
   pffmain.c: Test program for ppfai*
   Makefile : Makefile
   *.txt    : Various text files with informations (brdppf.txt is THIS file)
   ppfai.h
   brdai.h  : Heade files for brd* ppf*

c. How can one create his own program that is runable under both BSPlib
   and LAMMPI.

   1. Copy ai.h aimpi.c aimpi.h in a local directory

   2. A strong suggestion to copy brdmain.c amd aimisc* files
      to create your own test program that includes the main function.

      aimisc.c has functions for
      a. broadcasting comman-line values from processor 0 to every other
         processor.
      b. To find the minimum and maximum time if you perform timing
         experiments overall processors.

      *main.c
      c. Gives a skeleton on how to start/terminate a program and
         deal with comman line arguments.

   3. Copy and edit the Makefile.

   You are done.
*********
Section D
---------

1. The original BSP code suffers when being compiled under MPI from
   the following problems.
   a. Put operation under MPI do not seem to allow src and dest to
      be the same address space, whereas bsp_puts do. THis causes
      problems as buffering is required.
   b. In tree-based prefix operations extra WIN_FENCE operations are
      required (aka AISYNC) for the code to work properly.
      This siginficantly delays performance
2. Performance related  questions are addressed briefly in
   ppf.txt  brd.txt where timings of the operations on a 16-processor
   cluster (a CISCO 24-port 2924 Catalyst Switch was used for
   connectivity purposes) are reported.


3. BSPlib implementation of bsp_bcast
   For an array of size n on p procs bsp_bcast implements a 2-phase
   algorithm by having the source processor, call it 0, send every other
   processor  0..p-2 a block of size floor(n/p) and to processor
   p-1 a block of size n- (p-1) * floor(n/p). For large p however,
   processor imbalanca can be siginficant.
   Our implementation makes sure that any processor get  either
   floor(n/p) or ceiling(n/p) 


Section E.
**********
From my experience the following options seem to work quite well as long
as one does not broadcast large message sizes or ppf large arrays.

BSPLINK = bspcc -O3 -flibrary-level 2  -bspnoslots 4  \
-bspslotsize  40 -bsproundtrip 200 -bspsendlatency 25

The round trip delay on my cluster (CISCO 100mbit port switch) round
trip is about 200-250; this could be as low as 180 but no less. The
send latency is the absolute minimum at 25; setting it lower may
cause kernel/network card buffer overflows panics  and Linux crashes.

If one runs bppf 2 128000 1 , then the 2-phase broadcast suffers
from performance degradation. If one observes the minimum-maximum reported
times there is a huge discrepancy compared to the other algorithms as well
This is to mean that the choice of the parameters needs to be adjusted
One good choice is the one below. Higher may be better. But the most
improvements seem to come from an increases in slot size (from 40 to
100) and a drop in discrepancy between min and max times.
#
BSPLINK = bspcc -O3 -flibrary-level 2  -bspnoslots 4  \
-bspslotsize  100 -bsproundtrip 200 -bspsendlatency 25
