Homework 4 (3 Unit students only):

Implementing and Optimizing a Distributed Preconditioned Conjugate Gradient (CG) Solver

Due Date: May 9th, 2025 at 11:59 PM

Introduction

In this assignment, you will develop a distributed version of the preconditioned conjugate gradient method for solving linear systems. This website has some details and explanations: https://phtournier.pages.math.cnrs.fr/5mm29/blockjacobi/

Due Date: May 9th, 2025 at 11:59 PM

Please form 3-4 person groups by Sunday, April 20, 11:59 PM. If you have not found a group by then, we will randomly assign. Please use the following project group form: https://forms.gle/rAmpq35qz93ueABP8

Goals:

The goal of this assignment is to use preconditioned conjugate gradients (PCG) to solve the 1D NxN Poisson's equation Lx = b (not (L + I)x = b, we have updated the github), where L has 2s on its diagonal and -1s on it super/sub-diagonals (see the starter code to understand the pattern). The vector b is all 1s, and the PCG method starts with an initial guess x of all 0s. The starter code only works for 1 rank, and it is not efficient. In this assignment, please use MPI to develop a distributed version of this PCG solver. For simplicity, we assume that N is a multiple of the number of ranks used. Specifically, please do the following:
1. Sparse Matrix Format Conversion: Translate the provided starter-code–map-based sparse matrix representation—to a more efficient format (i.e. CSR (Compressed Sparse Row) or CSC (Compressed Sparse Column))
2. Layout Formulation: Figure out how to distribute the sparse matrix format across MPI processes.
3. Parallelizing Operations in PCG: Parallelize matrix-vector multiplication and norm across different MPI ranks.
4. Optimize the PCG framework: Is there any opportunity to reuse intermediate values or reorder operations to make the solver faster? Do the methods you used affect the residual error, and why?
5. Scaling: Evaluate strong and weak scaling of your distributed PCG solver by conducting tests with varying numbers of ranks to understand the performance characteristics of your implementation. Evaluate speedups across 1-64 ranks for one node and across different NxN matrices (N=2^20 - 2^26).
  - Also, plot the relationship between the number of PCG iterations and the number of ranks used, for a few N. Explain the behavior based on your understanding of the algorithm and the code.
  - You should also plot the relationship between the residual error and the number of ranks used, for a few N. Explain the behavior based on your understanding of the algorithm and the code.
  - Moreover, if you adopt the preconditioner used for 64 ranks, but run it sequentially, you might get a slightly different residual error. Explain why you might (or might not) observe this behavior based on your own distributed implementation.

Submission Requirements

Your submission should include distributed_pcg.cpp and the report pdf. The report must detail your implementation strategy, performance comparisons, scalability results, and any insights or challenges you encountered throughout the assignment. Ensure your code is well-documented and follows the assignment specifications closely.

Starter Code

The starter code is available on GitHub at https://github.com/Berkeley-CS267/HW4
Note that this starter code on GitHub only runs on 1 node and 1 task.
The README on GitHub has instructions on how to build your code, etc.

Report abuse