MATH 707: Applications of Parallel Computing

Homework 2: Parallelizing a Particle Simulation

Summary

Task is to parallelize a toy particle simulator that reproduces the behaviour shown in the following animation. Position of each particle is tested to ensure it is not near another particle (distance is assigned as a cutoff) for each time step and the particles are moved accordingly to avoid collision.

The problem source code employs a Serial, OpenMP and MPI approach to achieve parallelization. This report decribes the algorithms used for each method to improve efficiency.

Animation of particle interactions

Fig 1: Simulation of particles

Compiler Optimization

Turning on optimization flags makes the compiler attempt to improve the performance at the expense of compilation time and in some cases the ability to debug the program.

The -0fast option in the compiler, enables all optimizations available through the compiler.

Fig 2: Compiler optimization flags in the Makefile

Serial optimization

In the original Serial implementation, each particle is individually analyzed to calclulate the shortest distance with all the particles and forces are applied to move the particles at each time step. We can reduce the overall complexity of the simulation to O(n) if we only consider the neighbouring particles.

This is achieved by seperating particles and assigning them into bins based on its coordinate location in the grid. Thus, during each analysis of the particle, only the particles that are present in the local neighbourhood (up, down, left and right) will be investigated for probable collision.

Fig 3: Representation of binning in local neighbourhood around selected particle

As shown in Fig 3, the local neighbourhood will always have 9 bins to consider which significantly reduces the running time complexity when compared the the original model. To further optimize this approach, the dimensions of each bin should be defined as a function of the cuttoff value. Particles are stored as a list of vectors bin_type and are all stored into bins in the begining as shown in Fig 4.

Fig 4: Initial binning algorithm assigning particles into binds based on location

Fig 5: Forces are applied on the particles in the bins lying in the local neighbourhood

The binned particles obtained in Fig 4 are used to determine the locations of bins within which the particles that lie in the local neighbourhood are determined using the algorithm shown in Fig 5. Simulation times are measured in seconds and represented in Fig 6.

Fig 6: Output obtained on running auto-bridges-serial

Fig 7: Plot between number of particles and serial algorithm simulation time

Shared Memory Algorithm, OpenMP

Next using the same algorithm, OpenMP was implemented to introduce parallelization into the particle simulation. This is achieved by assigning the particles into an additional vector list to track the movements and the results are collected from all the threads in the end and accumulated synchronously.

The #pragma omp master thread is used to recalculate the bins for each particle that moved in the previous time step and #pragma omp barrier command was used to ensure the program would wait for the calculations to complete in all the threads within a time step.

Graphical representaiton of the runtime results for both strong and weak scaling simulations are represented in the below figures. The plots are consistent with the expected behaviours for both implementations since the algorithms used intends to reduce the computational complexity from O(n^2) to O(n).

Fig 8: Parallelized algorithm using #pragma omp for to compute force for particles

Fig 9: Parallized algorithm that moves the particles and rebins them after each time step

Fig 10: Output obtained from running auto-bridges-openmp16

Fig 11: Plot between number of threads and shared algorithm (OpenMP) strong scaling simulation time

Fig 12: Plot between number of threads and shared algorithm (OpenMP) weak scaling simulation time

Distributed Memory Algorithm, MPI

We implement the following distributed memory algorithm using MPI functions in C++.

Broadcast all particles to all processors, bearing large communication overhead to save later
All processors place particles into bins
All processors calculate start and end row indices, to define the rows of bins that each processor will be simulating the particles of
Simulate for N timesteps:
- All processors compute forces for all particles for all bins in their rows from start to end row indices
- All processors calculate intra and inter-processor particle movement (with respect to the rows of bins each processor simulates)
- All processors perform intra-processor moves
- Root process gathers all information about moved particles between processors, and scatters back to all processors
- All processors receive particles from root processor about inter-process moves and re-bins these inbound particles
- All processors have new local state of particles and can continue with next simulation step

After implementing MPI distributed memory algorithm, we achieved simulation times as seen in the output below. Graphs were plotted between number of threads and simulation times for strong scaling and weak scaling.

Fig 13: Output obtained from running auto-bridges-mpi16

Fig 14: Plot between number of threads and distributed algorithm (MPI) strong scaling simulation time

Fig 15: Plot between number of threads and distributed algorithm (MPI) weak scaling simulation time

Conclusion

After testing various MPI implementation methods and schemes, we were able to achieve strong scaling efficiency of 0.23 and weak scaling efficiency of 0.23. We can reduce the overall complexity of the simulation to O(n) if we only consider the neighbouring particles.

In the case of Shared memory algorithms, it has a slower performance than serial. Thus it is important to avoid excessive synchronization between threads while ensuring that all threads are on the same step of the algorithm. Threads compete for a particular set of bins. Hence we have to implement locks carefully and avoid acquiring multiple locks. Future work will explore further speedup methods such as using GPU to accelerate computation.

Source Code:

Source code can be found in the following archive, particles.tgz

Back to Home Page