PARALLEL IMPLEMENTATION OF CONJUGATE GRADIENT SPARSE LINEAR SYSTEM SOLVER LIBRARY THAT SCALES VERY WELL
Parallel implementation of Conjugate Gradient Linear Sparse System Solver library that scales very well version 1.79
Author: Amine Moulay Ramdane
Description:
I have come up with a new algorithm of my Parallel Conjugate gradient sparse solver library, now it has become cache-aware, but you have to notice that this new cache-aware algorithm is more efficient on multicores, since i have benchmarked it against my previous algorithm and it has given a scalability of 5X on a Quadcore over the single thread of my previous algorithm , that's a really a big improvement !.
This Parallel library is especially designed for large scale industrial engineering problems that you find on industrial Finite element problems and such, this scalable Parallel library was ported to FreePascal and all the Delphi XE versions and even to Delphi 7, hope you will find it really good.
Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.
Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.
Read here:
https://en.wikipedia.org/wiki/Sparse_matrix
The conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.
Read more here: http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf
As you have noticed it says:
"When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use specialized algorithms and data structures that take advantage of the sparse structure of the matrix. Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to large sparse matrices as processing and memory are wasted on the zeroes. Sparse data is by nature more easily compressed and thus require significantly less storage. Some very large sparse matrices are infeasible to manipulate using standard dense-matrix algorithms."
I have taken care of that on my new algorithm, i have used my ParallelHashList datastructure to store the sparse matrices of the linear systems so that it become very fast and so that it doesn't waste on the zeros, in fact my new algorithm doesn't store the zeros of the sparse matrix of the linear system.
I have also implemented another scalable parallel algorithm that is cache-aware and NUMA-aware and that is scalable on NUMA architecture, and that is designed for dense matrices that you find on Linear Equations arising from Integral Equation Formulations, this one stores the zeros of the sparse matrix of the linear system , here it is:
The Parallel implementation of Conjugate Gradient Sparse Linear System Solver that i programmed here is designed to be used to solve large sparse systems of linear equations where the direct methods can exceed available machine memory and/or be extremely time-consuming. for example the direct method of the Gauss algorithm takes O(n^2) in the back substitution process and is dominated by the O(n^3) forward elimination process, that means, if for example an operation takes 10^-9 second and we have 1000 equations , the elimination process in the Gauss algorithm will takes 0.7 second, but if we have 10000 equations in the system , the elimination process in the Gauss algorithm will take 11 minutes !. This is why i have develloped for you the Parallel implementation of Conjugate Gradient Sparse Linear System Solver in Object Pascal, that is very fast.
You have only one method to use that is Solve()
function TParallelConjugateGradient.Solve(var A: arrarrext;var B,X:VECT;var RSQ:DOUBLE;nbr_iter:integer;show_iter:boolean): boolean;
The system: A*x = b
The important parameters in the Solve() method are:
A is the matrix , B is the b vector, X the initial vector x,
nbr_iter is the number of iterations that you want and show_iter to show the number of iteration on the screen.
RSQ is the sum of the squares of the components of the residual vector A.x - b.
The Solve() method is thread-safe, so you can call it from multiple threads, everything else is thread-safe except for the constructor , you have to call the constructor one time from a process and use the object from multiple threads.
I have gotten over 5X scalability on a quad core.
The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written with neither illustrations nor intuition, and their victims can be found to this day babbling senselessly in the corners of dusty libraries. For this reason, a deep, geometric understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Conjugate gradient is the most popular iterative method for solving large systems of linear equations. CG is effective for systems of the form A.x = b where x is an unknown vector, b is a known vector, A is a known square, symmetric, positive-definite (or positive-indefinite) matrix. These systems arise in many important settings, such as finite difference and finite element methods for solving partial differential equations, structural analysis, circuit analysis, and math homework
The Conjugate gradient method can also be applied to non-linear problems, but with much less success since the non-linear functions have multiple minimums. The Conjugate gradient method will indeed find a minimum of such a nonlinear function, but it is in no way guaranteed to be a global minimum, or the minimum that is desired. But the conjugate gradient method is great iterative method for solving large, sparse linear systems with a symmetric, positive, definite matrix.
In the method of conjugate gradients the residuals are not used as search directions, as in the steepest decent method, cause searching can require a large number of iterations as the residuals zig zag towards the minimum value for ill-conditioned matrices. But instead conjugate gradient method uses the residuals as a basis to form conjugate search directions . In this manner, the conjugated gradients (residuals) form a basis of search directions to minimize the quadratic function f(x)=1/2*Transpose(x)*A*x + Transpose(b)*x and to achieve faster speed and result of dim(N) convergence.
You can go to download the zip files from the following web link:
https://drive.google.com/drive/folders/1PzkB8UnMckfbyY_XMyuhG4Xh5E8oZOB3?usp=sharing
Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/
Operating Systems: Windows, Mac OSX , Linux...
Required FPC switches: -O3 -Sd
-Sd for delphi mode....
Required Delphi switches: -$H+ -DDelphi
{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems
{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems
And please comment the MREW define's option and uncomment the DRWLOCK define's option inside defines.inc or defines2.inc to use the scalable distributed reader-writer mutex that scales better, and when you set it for DRWLOCK , please set the number of rwlocks parameter of the TSparseMatrix constructor to around 2000 for example, this will use less memory and this will be scalable.