This repository contains 3 different solutions to parallelising a provided simple solver for the Navier-Stokes equations - used to treat laminar flows of viscous, incompressible fluids
The three technologies used are OpenMP, CUDA & MPI.
Initially used gprof
to identify the most time consuming section of the code. This was identified as the poisson function - accounting for
approximately 96.53% of the program's runtime.

A python script validator.py
was used to validate the outputs of each port by comparing the VTK files to the
original. The script categorized each value in the files as exact, close (Β±0.02), or wrong.
Port | Wrong | Close (Β± 0.02) | Exact | Cosine Similarity | Valid? |
---|---|---|---|---|---|
OpenMP | 0 | 0 | 267,302 | 100 | β |
Cuda | 0 | 0 | 267,302 | 100 | β |
MPI | 0 | 0 | 267,302 | 100 | β |
Example output of validation script:
Comparing implementation (original.vtk) to parallel implementation (openmp.vtk):
WRONG: 0/267302 β 0.0000%
CLOSE: 0/267302 β 0.0000%
EXACT: 267302/267302 β 100.0000%
Note: Close values are determined using a tolerance value of 0.02. Percentages are calculated to 4 decimal places.
Cosine Similarity: 100.0
PASS: Both files are an exact match β successful parallel implementation.
Various locations were found to include the following pragma examples to share loop iterations between threads:
#pragma omp parallel for collapse(2) reduction(+:p0)
#pragma omp parallel for collapse(2) reduction(+:res)
#pragma omp parallel for collapse(2) reduction(max:umax)
#pragma omp parallel for collapse(2) reduction(max:vmax)


To ensure consistent conditions, all ports were evaluated on Viking, the University of Yorks super computer. The main evaluations measured the total time for the main loop to complete, using a code timer, across 20 problem sizes. Each size was tested multiple times and averaged to reduce outliers. CUDA experiments were conducted with and without checkpoints to assess overhead. An OpenMP experiment was also run to evaluate the effect of thread count. All profiling was performed on Viking.



Unfortunately while the MPI approach kept the validity of the solution, I was not able to successfully complete the approach mentioned previously, hence the lack of a significant speedup.


Port | Average Time | Speedup |
---|---|---|
Original | 135.26 | - |
OpenMP | 13.96 | x9.7 |
Cuda | 22.84 | x5.95 |
MPI | 131.81 | x1.02 |
All code submitted as part of a masters module at the University of York - High-Performance Parallel and Distributed Systems.