MATH 707: Applications of Parallel Computing

Homework 3: Parallelizing a Particle Simulation using GPU

 

Summary

Task is to parallelize a toy particle simulator that reproduces the behaviour shown in the following animation. Position of each particle is tested to ensure it is not near another particle (distance is assigned as a cutoff) for each time step and the particles are moved accordingly to avoid collision.

     The problem source code employs programs to achieve perform the simulation using a serial approach as well as using CUDA GPU. This report attempts to describe the steps taken to optimize the naive GPU method.

 

Animation of particle interactions

Fig 1: Simulation of particles

 

GPU optimization

The CUDA model that is employed makes combined used of both CPU and GPU. The code contained within the host refers to the CPU and its memory and the section of code that is reffered to as device implies the GPU and its memory. The host runs the program and manages memory on both the host and device, and also initializes the kernels that will run in parallel through the device GPU. The sequence of operation that this program follows will be,

  1. Declare and allocate memory for the CPU and GPU
  2. Declare and initialize the host variables
  3. Transmit data from the host to the device
  4. Use the transmitted data to run functions aka kernels
  5. Transmit the results back to the host

     In the original Serialimplementation, each particle is individually analyzed to calclulate the shortest distance with all the particles and forces are applied to move the particles at each time step. We can reduce the overall complexity of the simulation to O(n) if we only consider the neighbouring particles.

     This was achieved by seperating particles and assigning them into bins based on its coordinate location in the grid as seen in Fig 2. Thus, during each analysis of the particle, only the particles that are present in the local neighbourhood (up, down, left and right) will be investigated for probable collision.

 

Fig 2: Representation of binning in local neighbourhood around selected particle

 

     However, in CUDA implementation, this approach will be very expensive on the GPU. Therefore, the particles are stored in a flat array in a column-major format. The rest of the program is the same as the optimized version found here. Simulation times are measured in seconds and compared for GPU and serial programs, represented in Fig 3. Fig 4 shows the graphical comparison and GPU and serial programs and Fig 5 shows a GPU particle timing run at increasing number of particles.

 

Fig 3: Output obtained on running auto-bridges-serial

 

Fig 4: Plot between simulation time obtained using Serial and GPU approach

 

 

Fig 5: Plot between number of particles and serial algorithm simulation time

 

Table 1: Simulation time comparison between GPU and Serial methods for varying number of particles

 

Table 2: Simulation time using GPU for increasing number of particles

Conclusion

We can immediately see that the CUDA implementation using the GPU performs significantly better than its Serial counterpart. The difference is especially noted when the number of particles becomes greater than 1000. In the case of running only the GPU implementation, at a particle size of 5 x 10^7, the time taken for simulation was 39.96 seconds. CUDA lets us execute this application using a large number of threads and at very high speed. 

 

Source Code:

Source code can be found in the following archive, particles_gpu.tgz

 

 

Back to Home Page