Post date: Mar 1, 2010 11:12:13 PM
This was the topic of my master's thesis for the M.Sc. in Parallel and Distributed Computing of the Polytechnic University of Valencia. It proposes a way to use GPUs to speed-up the computation of the eigenvalues of a special class of matrices: the symmetric Toepliz matrices. These usually appear in problems related to signal processing among others.
Real Toepliz matrices, and concretely the symmetric ones, have interesting structural properties that can be taken into advantage. One of them is that Toeplitz systems can be solved analytically in O(n^2) instead of the classical O(n^3). There are different ways of achieving this that will be discussed later.
Another interesting property is the possibility to calculate its Gershgorin disc very easily in linear time, leaving a real interval that contains all the matrix eigenvalues. With some extra calculations, this interval can be easily divided in subintervals where the number of eigenvalues contained in each of them is known and similar to avoid unbalanced cases. After that, the shift-and-invert 2-way Lanczos method can be used to independently extract the eigenvalues of each subinterval.
This shift-and-invert 2-way Lanczos method is a variation of the Lanczos algorithm for eigenvalue computation in symmetric matrices. As the original method converges first to the largest eigenvalue, a shift-and-invert technique is used to modify this convergence to the eigenvalue nearest to a given σ. This σ value is initially set to the center of the subinterval and later set randomly inside its boundaries, as there may be numerical unstabilities if the chosen σ value is very close to a real eigenvalue.
The second modification to the method, the 2-way, is about how the Krylov subspace is built by the Lanczos method. Again, taking advantage of the symmetric Toeplitz matrix properties it is possible to build two Krylov subspaces at the same time in each iteration because of special symmetry properties in the system solution vector. This effectively allows to extract eigenvalues at a double rate than the original Lanczos method.
The key idea on the problem parallelization is the complete independency between the different subintervals. Since the target was to implement these concepts on GPUs, different CUDA thread blocks were assigned to each of the subintervals. Each of these blocks would implement the shift-and-invert 2-way Lanczos method independently until all its eigenvalues have been found within a given error threshold.
Another key point mentioned before is the system solving step, one of the most important and usually expensive steps of the Lanczos algorithm. The performance of the eigenvalue computation method relies on the efficiency to solve symmetric Toeplitz systems. One way to solve them in O(n^2) is to transform the symmetric Toeplitz matrix into a Cauchy-like matrix that can be decomposed in two blocks of half the original size. This can be done by means of the normalized Direct Sine Transform (similar to an FFT) and some permutations. Once the matrix is decomposed in two independent blocks, an adapted LDL' decomposition is applied once to each of them. From these decompositions solving one vector is reduced to solve a pair of triangular systems for each block, allowing to solve many different vectors for the same matrix without calculating the LDL' decomposition again. This comes pretty handy in the Lanczos algorithm, where the matrix is constant for each σ but different right-hand-side vectors must be solved in each iteration. However, a little processing is required to the input and output vectors, as the same normalized Discrete Sine Transform and permutations must be applied in and back to them, but its little computational cost is completely worth it. This same LDL' decomposition approach is used to compute how many eigenvalues there are less than a given σ in the subinterval partitioning.
Unfortunally, the only graphics card we could get before the thesis deadline was a GeForce 9800 GX2 with no double precision support and only 512 MB of memory per GPU (it's a dual-GPU card). This led to serveral problems. First of all, the approach just explained requires too much memory for this card, as the LDL' decomposition of the two Cauchy-like blocks raises memory requirements to O(n^2/2) since the full matrix blocks need to be stored for optimal memory access. This led to the use of the Levinson's algorithm for solving the symmetric Toeplitz systems in O(n^2) time. The problem with this method is the number of reductions performed in each iteration (operations which project to lower dimensionality like dot products or norms), which can introduce a performance penalty in GPUs. Also, this algorithm requires the input matrices to be symmetric positive-definite, constraining the input of the global problem.
The second problem was, of course, precision. Single precision was just not enough to handle this problem properly, and only the eigenvalues of very small matrices could be computed properly. Also, the frequent use of accumulator variables by the Levinson's algorithm operations introduced even more errors when the values accumulated had different orders of magnitude. As GPUs provide no real acceleration for small problem sizes but for big ones, this rendered the implementation, as it is, useless.
However this work provided interesting information that allowed further research to be done. Not much after the thesis deadline we got a Quadro FX 5800 graphics card, and now even some Tesla. All those GPUs support double precision floating point operations and have 4 GB of GDDR3 memory. With the main previous problems overcome, now we have multi-GPU implementations that solve symmetric Toeplitz systems about 10~12 times faster than the CPU, with sizes up to approximately n=46000. All details about this will be published soon, and with these new results the original eigenvalue problem can be faced again.
Back to the original thesis subject, all details can be found in the attached pdf, but I'm afraid it's Spanish only. If you have any questions, don't hesitate to ask.