AUGUST 14, 2020:  PARALLEL INTEGER SORTING
------------------------------------------
A collection of parallel integer sorting routines including
  (a) 32-bit (unsigned int) sorting using a 4-round count-sort (radix-sort)
  (b) 32-bit (unsigned int) sorting using a 2-round count-sort (radix-sort)
  (c) 32-bit (unsigned int) sorting using a 4-round count-sort (radix-sort)
      a different implementation
  (d) integer sorting using a modification of the generic sorting 
      BSP (Bulk Synchronous Parallel)
      randomized sorting algorithm (Gerbessiotis-Valiant)
      The modification involves essentially an integer sort of data based
      on destination processor id that allows for fine-grained communication
      rather than an inefficient fine-grained communication
  (e) integer sorting using a generic sorting BSP (Bulk Synchronous Parallel)
      deterministic sorting algorithm (Gerbessiotis-Siniolakis)
  (f) integer sorting using a generic sorting BSP (Bulk Synchronous Parallel)
      randomized    sorting algorithm (Gerbessiotis) utilizing the structure
      of the deterministic sorting algorithm (Gerbessiotis-Siniolakis)

There are four makefiles
  Makefile.pisrt is just for backup
  Makefile.bsl is for BSPlib       compilation
  Makefile.bsp is for MulticoreBSP compilation (version 1.2 only)
  Makefile.mpi is for MPI (MPI-2 primitive support required) compilation
  (Lam-MPI or open-MPI, the latter being used and tested)

  After the compilation information, information is provided in installing
  your own version of BSPlib, MulticoreBSP and openMPI, if needed.

Before compilation starts
  Edit srtpari/avgaii.h  and decide
  whether one of BSPlib (BSPLIB defed) or 
           MulticoreBSP (BSPLIB and MCORE defed) 
  will be used
  or MPI (undef BSPLIB, MCORE and def MPILIB).

Thus in the first step you need to decide whether MPI is to be used 
(and thus be defined, and at the same time undef BSPlib) or not.
If not, the state would be as shown below. This also means BSPLIB is 
defined and not undef. The choice of MulticoreBSP vs BSPlib is determined
by the define or not of the MCORE flag.

*******************************************
STEP 1: BSPlib or MulticoreBSP (v1.2) 
*******************************************
Edit avgaii.h and make sure  that the ordering is as follows
for Multicorebsp usage
#undef  BSPLIB
#define BSPLIB
#undef  MCORE
#define MCORE  /*swap def/under MCORE lines :MultiCoreBSP vs BSPlib*/
#define MPILIB
#undef  MPILIB /*swap def/under MPILIB lines to active MPI         */

For  BSPlib usage, do the following
#undef  BSPLIB
#define BSPLIB
#define MCORE  /*swap def/under MCORE lines :MultiCoreBSP vs BSPlib*/
#undef  MCORE
#define MPILIB
#undef  MPILIB /*swap def/under MPILIB lines to active MPI         */


Summary : #define BSPlib for either library
             #define MCORE  for MulticoreBSP 
             #undef  MCORE  for BSPlib 
          #undef MPILIB  must also be the last one or expect trouble!


*******************************************
STEP 2: Compile code and link with  BSPlib by running
        make clean -f Makefilei.bsl ; make all -f Makefilei.bsl 

        Compile code with  MulticoreBSP 1.2  by running
        make clean -f Makefilei.bsp ; make all -f Makefilei.bsp 

        You may replace clean with erase 
*******************************************


-----------------------------------------------
 2.a INSTALLATION of BSPlib and a fix: Installing BSPlib 
-----------------------------------------------
Note that BSPlib installation   uses tcsh! 
Note that BSPlib execution also uses tcsh!
So it is imperative that you do all relevant step in a tcsh window.

I am still using BSPlib version 1.4. 
However its installation can fail (under shmem) for the following reason: 
A Makefile needs a minor adjustment. In my case I edit
 BSP/src/library_core/Makefile

so that line 419  includes -Di386 under LINUX) option
Makefile 
419c419
<                                   -DUNDERSCORE -ansi -DLINUX             \
---
>                                   -DUNDERSCORE -ansi -DLINUX -Di386      \

The rest of the installation is straightforward
% gzip -d v1.4_bsplib_toolset.tar.gz
% tar xvf v1.4_bsplib_toolset.tar
% cd BSP
% ./configure
  asnwer LINUX,  SHMEM_SYSV and 
  for number of procs whatever you can afford or is appropriate (eg 16).
  Let it work out the configuration otherwise.
% # now edit the Make file as show above

% make
% make install

 you might need to put BSP/bin in your path eg  
set path=($path /home/XXXXX/BSP/bin)
 or 
set path=(/home/XXXXX/BSP/bin $path)
in .cshrc and you are ready to go
(Do a rehash and type bspcc to confirm that everything is ok path-wise)

The BSPlib compilation I am using is pre (or around) gcc/4.8.2

-----------------------------
 2.b EXAMPLE: Typical BSPlib compilation with gcc 8.3.0 and also 4.8.2
-----------------------------
Note: The current code might use by default skylake optimizations.
If this does not work for you replace the occurrence of skylake
with native in the corresponding Makefile. The examples below
use native.

MORE RECENT COMPILATION OUTPUT
% make clean -f Makefile.bsl ; make all -f Makefile.bsl
rm -rf aimisc.o prdxa.o prdxb.o i32rdx4.o btn.o mbspisrt.o oets.o gsd.o gvr.o ger.o seqo.o  mpipisrt bspisrt
bspcc  -march=native -O3 -pedantic -c srtpari/aimisc.c  
bspcc  -march=native -O3 -pedantic -c srtpari/prdxa.c  
bspcc  -march=native -O3 -pedantic -c srtpari/prdxb.c  
bspcc  -march=native -O3 -pedantic -c srtpari/i32rdx4.c 
bspcc  -march=native -O3 -pedantic -c srtpari/btn.c 
bspcc  -march=native -O3 -pedantic -c srtpari/mbspisrt.c 
bspcc  -march=native -O3 -pedantic -c srtpari/oets.c 
bspcc  -march=native -O3 -pedantic -c srtpari/gsd.c 
bspcc  -march=native -O3 -pedantic -c srtpari/gvr.c 
bspcc  -march=native -O3 -pedantic -c srtpari/ger.c 
bspcc  -march=native -O3 -pedantic -c srtpari/seqo.c 
bspcc -I. aimisc.o prdxa.o prdxb.o i32rdx4.o btn.o mbspisrt.o oets.o gsd.o gvr.o ger.o seqo.o -o bspisrt  -flibrary-level 2 -bspnobuffers 2 -bspfifo 2000 -bspbuffer 15000 -fcombine-puts-buffer 2000 -L. 
Use of assignment to $[ is deprecated at /afs/cad/u/a/l/alexg/BSP/include/ctime.pl line 27.

-----------------------------
2.c Typical BSPlib execution  :
-----------------------------

Note observe the name of the executable file generated. It might be
different from the one indicated here. It might also be different
(due to my slopiness) in a Makefile.

% bsprun -npes 4  ./bspisrt 4 1000000 1 3 0
  4      : # of processors/processes/cores 
 1000000 : number of keys per processor
  1      : type of input (aka random ints)
           if 1  : random ints
           if 2  : some random int variation
           if 3  : sorted seqence f(i)=i
           if 4  : 17
           else  : reverse sorted sequence f(i)=n-i;

  3      : number of runs (average is given as runtime)
  0      : debug option (1 is the alternative)

 OK in the output means that the output was collected to one processor
 and was compared to the output of the sequential method used for the first
 run (i32rdx4 or qsort). If they are equal OK is printed else the first error is
 output (keys differing and index position in the sequential/serial output).

LIMITATIONS : Number of keys = #processors/es  * Ints per processor.

[RECENT RUN]
#define PARALLELSAMPLESORT   /* default */
#undef  PARALLELSAMPLESORT
#undef  ROUTEONE
#define ROUTEONE             /* default */
#undef  KWAYMERGE
#define KWAYMERGE            /* default */


% bsprun -npes 4 ./bspisrt 4 1000000 1 4 0
pid=0 of 4 n=1000000 type=1 runs =4 debug=0   // Run 2018 or 2019
(4000000, 1,4):                 i32r4 generic: Elapsed time is: 0.02748100 OK
(4000000, 1,4):                  rdx4 generic: Elapsed time is: 0.01607675 OK
(4000000, 1,4):                  pprd generic: Elapsed time is: 0.01578150 OK
(4000000, 1,4):                  rdx2 generic: Elapsed time is: 0.12760600 OK
(4000000, 1,4):                  btns generic: Elapsed time is: 0.02676150 OK
(4000000, 1,4):                  oets generic: Elapsed time is: 0.02566375 OK
(4000000, 1,4):                  gsd  generic: Elapsed time is: 0.01906425 OK

% bsprun -npes 4 ./bspisrt 4 1000000 1 4 0  // Run 2020 with additional algs
					    // Skylake and -O3
(4000000, 1,4):                 i32r4        : Elapsed time is: 0.02772500 OK
(4000000, 1,4):                  rdx4        : Elapsed time is: 0.01750625 OK
(4000000, 1,4):                  pprd        : Elapsed time is: 0.01741075 OK
(4000000, 1,4):                  rdx2        : Elapsed time is: 0.16343900 OK
(4000000, 1,4):                  btns        : Elapsed time is: 0.02648550 OK
(4000000, 1,4):                  oets        : Elapsed time is: 0.02575300 OK
(4000000, 1,4):                  gsd         : Elapsed time is: 0.01930675 OK
(4000000, 1,4):                  gvr         : Elapsed time is: 0.02228600 OK
(4000000, 1,4):                  ger         : Elapsed time is: 0.01920500 OK
% bsprun -npes 4 ./bspisrt 4 1000000 1 4 0  // Run 2020 with additional algs
					    // native  and -O2
(4000000, 1,4):                 i32r4        : Elapsed time is: 0.02828700 OK
(4000000, 1,4):                  rdx4        : Elapsed time is: 0.01894600 OK
(4000000, 1,4):                  pprd        : Elapsed time is: 0.01867625 OK
(4000000, 1,4):                  rdx2        : Elapsed time is: 0.16671900 OK
(4000000, 1,4):                  btns        : Elapsed time is: 0.02646075 OK
(4000000, 1,4):                  oets        : Elapsed time is: 0.02584725 OK
(4000000, 1,4):                  gsd         : Elapsed time is: 0.01942125 OK
(4000000, 1,4):                  gvr         : Elapsed time is: 0.02262725 OK
(4000000, 1,4):                  ger         : Elapsed time is: 0.01918225 OK

-----------------------------------------------
 2.d INSTALLATION of Multicore-BSP 
-----------------------------------------------

 For compatibility use 1.2 version.

%   tar xvfz MulticoreBSP-for-C.tar.gz
%   cd MulticoreBSP-for-C;
%   make
%   make tests
%   make compat
     #  make compat works with version 1.2
     # make compat does not work with version 2.0.3.

-----------------------------
2.e  EXAMPLE: Typical MulticoreBSP compilation with gcc 8.3.0 or 4.8.2
-----------------------------
  Note that avaii.h has a ../../mcbsp.h . Make sure your .h is reachable there
% make clean -f Makefilei.bsl    # remove the BSPlib exe files
% #edit srtpai/avgaii.h to reflect the intent of compiling MulticoreBSP code

% Note that i am using 1.2 of MultiCoreBSP (as -DMCBSP_COMPATIBILITY_MODE works
with it only)

% make clean -f Makefilei.bsp ;make all -f Makefilei.bsp
rm -f *.o *~ core* bspisrt mpipisrt 
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o aimisc.o srtpari/aimisc.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o btn.o srtpari/btn.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o prdxa.o srtpari/prdxa.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o prdxb.o srtpari/prdxb.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o i32rdx4.o srtpari/i32rdx4.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o seqo.o srtpari/seqo.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o mbspisrt.o srtpari/mbspisrt.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o oets.o srtpari/oets.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o gsd.o srtpari/gsd.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o gvr.o srtpari/gvr.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I .   -c -o ger.o srtpari/ger.c
gcc -O3 -march=skylake -DMCBSP_COMPATIBILITY_MODE -Wall -pedantic -I . -o bspisrt aimisc.o btn.o prdxa.o prdxb.o i32rdx4.o seqo.o mbspisrt.o oets.o gsd.o gvr.o ger.o ../compat-libmcbsp1.2.0.a -pthread -lm -lrt

-----------------------------
2.f  Typical MulticoreBSP execution:
-----------------------------
To execute
 % ./bspisrt 4 1000000 1 4 0

cat machine.info
threads_per_core 2
thread_numbering wrapped

% ./bspisrt 4 1000000 1 4 0   
                               // 2020 run
(4000000, 1,4):                 i32r4        : Elapsed time is: 0.02712340 OK
(4000000, 1,4):                  rdx4        : Elapsed time is: 0.02274517 OK
(4000000, 1,4):                  pprd        : Elapsed time is: 0.02209138 OK
(4000000, 1,4):                  rdx2        : Elapsed time is: 0.04116896 OK
(4000000, 1,4):                  btns        : Elapsed time is: 0.02924326 OK
(4000000, 1,4):                  oets        : Elapsed time is: 0.02726086 OK
(4000000, 1,4):                  gsd         : Elapsed time is: 0.01718986 OK
(4000000, 1,4):                  gvr         : Elapsed time is: 0.01648456 OK
(4000000, 1,4):                  ger         : Elapsed time is: 0.01712897 OK



% ./bspisrt 4 1000000 1 4 0
pid=0 of 4 n=1000000 type=1 runs =4 debug=0 // 2018 or 2019 run
(4000000, 1,4):                 i32r4 generic: Elapsed time is: 0.02597171 OK
(4000000, 1,4):                  rdx4 generic: Elapsed time is: 0.02249856 OK
(4000000, 1,4):                  pprd generic: Elapsed time is: 0.02192516 OK
(4000000, 1,4):                  rdx2 generic: Elapsed time is: 0.04220291 OK
(4000000, 1,4):                  btns generic: Elapsed time is: 0.02987830 OK
(4000000, 1,4):                  oets generic: Elapsed time is: 0.02852352 OK
(4000000, 1,4):                  gsd  generic: Elapsed time is: 0.02153844 OK

cat machine.info
threads_per_core 2
thread_numbering consecutive


*******************************************
STEP 3: Compile code and link with  openMPI
        make clean -f Makefilei.mpi ; make all -f Makefilei.mpi 

*******************************************

-----------------------------------------------
 3.a INSTALLTION of openMPI
-----------------------------------------------
 I have invariably installed 2.1.1 and 1.8.1 openMPI plus MPICH 3.2.1

% cd openmpi-1.8.1
% ./configure --prefix=/your-local-place/openmpi-1.8.1
% make all   
  or 
% make -j 4  all   # threaded version for compilation and installation not
                   # execution or runtime
% make install

and of course .bash_profile
export LD_LIBRARY_PATH=/afs/cad/u/a/l/alexg/openmpi-1.8.1/lib:$LD_LIBRARY_PATH
export PATH=/afs/cad.njit.edu/u/a/l/alexg/openmpi-1.8.1/bin/:$PATH

for tcsh
set path=(  / your path to /openmpi-2.1.1/bin  $path)
or
set path=(  / your path to /mpich/bin  $path)
and 
setenv LD_LIBRARY_PATH  / your path to     /openmpi-2.1.1/lib
setenv LD_LIBRARY_PATH  / your path to     /mpich/lib

-----------------------------
3.b Typical MPI    compilation with gcc 8.3.0 or 4.8.5
-----------------------------
% make clean -f Makefilei.bsp
  (to remove the prior run of say MulticoreBSP)
% # edit srtpari/avgaii.h See below for the correct ordering
#define BSPLIB /*DEFAULT*/
#undef  BSPLIB
#undef  MCORE
#define MCORE  /*DEFAULT:swap with undef MCORE:MultiCoreBSP vs BSPlib*/
#undef  MPILIB /*DEFAULT:swap with define MPILIB: MPI + undef BSPLIB */
#define MPILIB

 % Note executable is either mpiisrt or mpipisrt
(Recent compilation) 
% make clean -f Makefilei.mpi ; make all -f Makefilei.mpi
rm -rf *.o *~ core*   avgmpi.o aimisc.o prdxa.o prdxb.o i32rdx4.o btn.o mmpiisrt.o oets.o gsd.o gvr.o ger.o seqo.o mmpiisrt bspisrt 
mpicc -O3 -march=native  -Wall    -c -o avgmpi.o srtpari/avgmpi.c
mpicc -O3 -march=native  -Wall    -c -o aimisc.o srtpari/aimisc.c
mpicc -O3 -march=native  -Wall    -c -o prdxa.o srtpari/prdxa.c
mpicc -O3 -march=native  -Wall    -c -o prdxb.o srtpari/prdxb.c
mpicc -O3 -march=native  -Wall    -c -o i32rdx4.o srtpari/i32rdx4.c
mpicc -O3 -march=native  -Wall    -c -o btn.o srtpari/btn.c
mpicc -O3 -march=native  -Wall    -c -o mmpiisrt.o srtpari/mmpiisrt.c
mpicc -O3 -march=native  -Wall    -c -o oets.o srtpari/oets.c
mpicc -O3 -march=native  -Wall    -c -o gsd.o srtpari/gsd.c
mpicc -O3 -march=native  -Wall    -c -o gvr.o srtpari/gvr.c
mpicc -O3 -march=native  -Wall    -c -o ger.o srtpari/ger.c
mpicc -O3 -march=native  -Wall    -c -o seqo.o srtpari/seqo.c
mpicc -O3 -march=native  -Wall  avgmpi.o aimisc.o prdxa.o prdxb.o i32rdx4.o btn.o mmpiisrt.o oets.o gsd.o gvr.o ger.o seqo.o -o mmpiisrt  -lm

(Old compilation)
% make clean -f Makefile.mpi  ; make all -f Makefile.mpi
rm -f *.o *~   bspisrt mpipisrt bspisrp
mpicc -Wall  -O2  -march=native     -c -o avgmpi.o avgmpi.c
mpicc -Wall  -O2  -march=native     -c -o aimisc.o aimisc.c
mpicc -Wall  -O2  -march=native     -c -o prdxa.o prdxa.c
mpicc -Wall  -O2  -march=native     -c -o prdxb.o prdxb.c
mpicc -Wall  -O2  -march=native     -c -o i32rdx4.o i32rdx4.c
mpicc -Wall  -O2  -march=native     -c -o btn.o btn.c
mpicc -Wall  -O2  -march=native     -c -o mpipisrt.o mpipisrt.c
mpicc -Wall  -O2  -march=native     -c -o oets.o oets.c
mpicc -Wall  -O2  -march=native   -o mpipisrt avgmpi.o aimisc.o prdxa.o prdxb.o i32rdx4.o btn.o mpipisrt.o oets.o -lm
j



-----------------------------
2.i Typical MPI          execution:
-----------------------------
To execute % mpirun -np 4 ./mpipisrt 1000000 1 4 0

% mpirun -np 4 ./mmpiisrt 1000000 1  4 0   // -O3 native
(4000000, 1,4):                 i32r4        : Elapsed time is: 0.02800703 OK
(4000000, 1,4):                  rdx4        : Elapsed time is: 0.02232820 OK
(4000000, 1,4):                  pprd        : Elapsed time is: 0.02135164 OK
(4000000, 1,4):                  rdx2        : Elapsed time is: 0.05249107 OK
(4000000, 1,4):                  btns        : Elapsed time is: 0.02654922 OK
(4000000, 1,4):                  oets        : Elapsed time is: 0.02545905 OK
(4000000, 1,4):                  gsd         : Elapsed time is: 0.01666933 OK
(4000000, 1,4):                  gvr         : Elapsed time is: 0.01593924 OK
(4000000, 1,4):                  ger         : Elapsed time is: 0.01639003 OK


% mpirun -np 4 ./mpipisrt 1000000 1  4 0 //-O3 skylake -ffast-math -funroll-loops 
(4000000, 1,4):                 i32r4        : Elapsed time is: 0.02663183 OK
(4000000, 1,4):                  rdx4        : Elapsed time is: 0.02212447 OK
(4000000, 1,4):                  pprd        : Elapsed time is: 0.02118772 OK
(4000000, 1,4):                  rdx2        : Elapsed time is: 0.05584049 OK
(4000000, 1,4):                  btns        : Elapsed time is: 0.02594316 OK
(4000000, 1,4):                  oets        : Elapsed time is: 0.02472907 OK
(4000000, 1,4):                  gsd         : Elapsed time is: 0.01952827 OK
(4000000, 1,4):                  gvr         : Elapsed time is: 0.02215505 OK
(4000000, 1,4):                  ger         : Elapsed time is: 0.01928681 OK



(OLD RUN) 4.8.5
% mpirun -np 4 ./mpipisrt 1000000 1 4 0
(4000000, 1,4):                 i32r4        : Elapsed time is: 0.03544255 OK
(4000000, 1,4):                  rdx4        : Elapsed time is: 0.02185536 OK
(4000000, 1,4):                  pprd        : Elapsed time is: 0.02090658 OK
(4000000, 1,4):                  rdx2        : Elapsed time is: 0.05773602 OK
(4000000, 1,4):                  btns        : Elapsed time is: 0.03132410 OK
(4000000, 1,4):                  oets        : Elapsed time is: 0.03092958 OK
(4000000, 1,4):                  gsd         : Elapsed time is: 0.02422978 OK

2018
-Ofast -march=native 
-O3    -march=native 
-Ofast -march=native  -flto
-O3    -march=native  -flto
-Ofast -march=native  -flto -funroll-loops
-O3    -march=native  -flto -funroll-loops
-Ofast -march=native  -funroll-loops
-O3    -march=native  -funroll-loops
-O3 -std=c99 -msse4 -mtune=native -march=native -funroll-loops --param max-unroll-times=4 -ffast-math
-fast -xSSE4.2 -ipo -no-prec-div -static -opt-prefetch -unroll-aggressive -m64


THE FOLLOWING ARE RUNS BEFORE JUNE 28, 2018 and correspond to a 
previous version of the library when gsd was not working properly and 
the wrapper function were causing crashes!


***************
OP1:  gcc -Wall -pedantic -O3  -march=native 
      gcc -Wall -pedantic -O3  -march=native  -funroll-loops not much improvement

MulticoreBSP
   	i32r4	rdx4	pprd	rdx2	btn	oets
4x1M	.0269	.0225	.0220	.0411	.0298	.0270
4x10M	.2729	.2295	.2256	.3171	.2963	.2854
8x.5M	.0389	.0216	.0213	.0442	.0441	.0460
8x5M	.3849	.2334	.2248	.4356	.4417	.4580


***************
OP2:  gcc -Wall -pedantic -O3  -mtune=native -march=native 

MulticoreBSP
   	i32r4	rdx4	pprd	rdx2	btn	oets
4x1M	.0275	.0212	.0207	.0339	.0284	.0270
4x10M	.2713	.2293	.2256	.3172	.2958	.2849
8x.5M	.0391	.0215	.0213	.0443	.0442	.0460
8x5M	.3836	.2333	.2249	.4351	.4418	.4583

***************
OP3: gcc -Wall -pedantic -O3  -funroll-loops -march=native  -flto 

MulticoreBSP
   	i32r4	rdx4	pprd	rdx2	btn	oets
4x1M	.0260	.0210	.0205	.0339	.0282	.0267
4x10M	.2619	.2294	.2241	.3171	.2948	
8x.5M	.0336	.0213	.0210	.0451	.0440	.0459
8x5M	.3422	.2313	.2229	.4356	.4404	.4568


***************
OP4  gcc -Wall -pedantic -O2  -march=native 

MulticoreBSP
   	i32r4	rdx4	pprd	rdx2	btn	oets
4x1M	.0279	.0213	.0207	.0339	.0285	.0270
4x10M	.2744	.2294	.2257	.3186	.2961	.2854
8x.5M	.0408	.0215	.0213	.0444	.0442	.0461
8x5M	.4045	.2335	.2250	.4361	.4422	.4576


BSPlib 2018 compilation  ;  c
   	i32r4	rdx4	pprd	rdx2	btn	oets   gsd
4x1M	.0280	.0177	.0172	.1368	.0277	.0256  .0431
4x10M	.2777	.1656	.1648	.3795	.2827	.2765  .4374
8x.5M	.0367	.0159	.0154	.0228	.0382	.0395  .0709
8x5M	.3747	.1610	.1585	.5651	.3917	.4098  .7162


openMPI 1.8.1 (mpirun -help) 2014 compilation
   	i32r4	rdx4	pprd	rdx2	btn	oets 
4x1M	.0278	.0221	.0213	.0462	.0268	.0254 
4x10M	.2745	.1966	.1964	.6976	.2809	.2687
8x.5M	.0476	.0296	.0283	.0702	.0396	.0418
8x5M	.4866	.2187	.2166	.6306	.4017	.4176


openMPI 1.8.1 (mpirun -help) 2018 compilation
   	i32r4	rdx4	pprd	rdx2	btn	oets 
4x1M	.0281	.0220	.0211	.0453	.0266	.0253 
4x10M	.2751	.1951	.1950	.6976	.2799	.2683
8x.5M	.0489	.0291	.0277	.0678	.0395	.0425
8x5M	.4937	.2181	.2155	.6344	.3967	.4130


openMPI 2.1.1 (mpirun -help) 2018 compilation
   	i32r4	rdx4	pprd	rdx2	btn	oets 
4x1M	.0280	.0218	.0210	.0571	.0310	.0324 
4x10M	.2760	.1535	.1530	.3301	.2729	.2718
8x.5M	.0368	.0226	.0222	.0652	.0370	.0389
8x5M	.3626	.1531	.1504	.4488	.3727	.3844


***************
OP1:  gcc -Wall -pedantic -O3  -march=native 
BSPlib 2018 compilation  
   	i32r4	rdx4	pprd	rdx2	btn	oets   gsd
4x1M	.0277	.0160	.0155	.1511	.0268	.0256  .0426
4x10M	.2782	.1646	.1648	.3792	.2823	.2765  .4377
8x.5M	.0368	.0159	.0155	.0221	.0381	.0396  .0715
8x5M	.3739	.1610	.1580	.5659	.3911	.4095  .7171

***************
OP5  gcc -Wall -pedantic -O2  -march=native --param max-unroll-times=4 -ffast-math -funroll-loops

MulticoreBSP slightly better is OP5 vs OP1
   	i32r4	rdx4	pprd	rdx2	btn	oets
4x1M	.0266	.0211	.0205	.0341	.0282	.0268
4x10M	.2603	.2278	.2240	.3183	.2936	.2836
8x.5M	.0321	.0214	.0211	.0445	.0440	.0459
8x5M	.3279	.2315	.2231	.4350	.4411	.4574

Note: Aug 10, 2020 @16:10
  Reorganization and move of script files; Makefiles

Note: Jul  9, 2020 @12:46
Effective with this update versions appear as capital english letter
at end i.e. 20pisrtA.tar or 20pisrtAl.tar


Update of 18pisrt_v4 @
052f4c0aac7e8f777ce0b24b1d1b54f4  18pisrtV4.tar
 Bear in mind differences between gcc/8.2.0 and gcc/4.8.2 in terms
of performance (latter faster)
# gcc -march=native -Q --help=target |grep -- '-march='
# Default otherwise is gcc/4.8.2
# Use correct vpath for sorting  (gsort) or merging (gmerge)
# module load gcc/9.1.0  use skylake for -march
# gcc -march=native -Q --help=target |grep -- '-march='
# Default otherwise is gcc/4.8.2
# Use correct vpath for sorting  (gsort) or merging (gmerge)

