Skip to content

πŸ“ˆ Demonstrations of different techniques to enhance the performance of computationally intensive scientific simulations in C.

Notifications You must be signed in to change notification settings

LucyIvatt/high-performance-computing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

high-performance-computing

C nVIDIA

This repository contains 3 different solutions to parallelising a provided simple solver for the Navier-Stokes equations - used to treat laminar flows of viscous, incompressible fluids

The three technologies used are OpenMP, CUDA & MPI.

Initial Investigation

Initially used gprof to identify the most time consuming section of the code. This was identified as the poisson function - accounting for approximately 96.53% of the program's runtime.

image

Validation

A python script validator.py was used to validate the outputs of each port by comparing the VTK files to the original. The script categorized each value in the files as exact, close (Β±0.02), or wrong.

Results

Port Wrong Close (Β± 0.02) Exact Cosine Similarity Valid?
OpenMP 0 0 267,302 100 βœ…
Cuda 0 0 267,302 100 βœ…
MPI 0 0 267,302 100 βœ…

Example output of validation script:

Comparing implementation (original.vtk) to parallel implementation (openmp.vtk):

WRONG: 0/267302 – 0.0000%
CLOSE: 0/267302 – 0.0000%
EXACT: 267302/267302 – 100.0000%

Note: Close values are determined using a tolerance value of 0.02. Percentages are calculated to 4 decimal places.

Cosine Similarity: 100.0
PASS: Both files are an exact match – successful parallel implementation.

Ports

OpenMP Approach

Various locations were found to include the following pragma examples to share loop iterations between threads:

#pragma omp parallel for collapse(2) reduction(+:p0)
#pragma omp parallel for collapse(2) reduction(+:res)
#pragma omp parallel for collapse(2) reduction(max:umax)
#pragma omp parallel for collapse(2) reduction(max:vmax)

CUDA Approach

image

MPI Approach

image

Benchmarks / Speedup

To ensure consistent conditions, all ports were evaluated on Viking, the University of Yorks super computer. The main evaluations measured the total time for the main loop to complete, using a code timer, across 20 problem sizes. Each size was tested multiple times and averaged to reduce outliers. CUDA experiments were conducted with and without checkpoints to assess overhead. An OpenMP experiment was also run to evaluate the effect of thread count. All profiling was performed on Viking.

Original Analysis

image

OpenMP Analysis

image

CUDA Analysis

image

MPI Analysis

Unfortunately while the MPI approach kept the validity of the solution, I was not able to successfully complete the approach mentioned previously, hence the lack of a significant speedup.

image

Comparative Analysis

image
Port Average Time Speedup
Original 135.26 -
OpenMP 13.96 x9.7
Cuda 22.84 x5.95
MPI 131.81 x1.02

All code submitted as part of a masters module at the University of York - High-Performance Parallel and Distributed Systems.

About

πŸ“ˆ Demonstrations of different techniques to enhance the performance of computationally intensive scientific simulations in C.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published