HPC

HPC: High Performance Computing / Scientific Computing

We tightly collaborate with the group 'High Performance Computing and Applications' from University of Almeria in the development and evaluation of High Performance Computing (HPC) techniques to accelerate computationally demanding problems in Three-Dimensional Electron Microscopy. We also collaborate in providing novel and fast approaches for major operations in scientific computing, such as sparse matrix vector product (SpMV). In these works, we devise solutions for execution on state-of-the-art HPC platforms (supercomputers, GPUs, standard multicore computers) and make use of different parallel paradigms and strategies (MPI, shared memory, GPU computing, vectorization, single-core code optimization; hybrid computing techniques).

People involved in this project (present and past):

JI Agulleiro

JR Bilbao-Castro

I Garcia

EM Garzon

JA Martinez

A Martinez-Sanchez

JJ Moreno

F Vazquez

Modern computing architectures. (Top-left) Modern computers ship with several multicore chips (typically 2–4) configured to share a centralized memory. Each multicore chip contains several computing cores (2–6) sharing a cache memory (typically the third level, L3). Internally, each core contains two more cache levels (L1 and L2, not shown in this figure). (top-right) Cluster of multicore computers. Each node has m processors sharing a single centralized memory. The nodes are then connected through an interconnection network. Most current supercomputers are also based on this architectural model. In this case, the so-called distributed-shared memory (DSM) architecture may be available, whereby there is a virtually unique memory system but there is a non-uniform memory access (NUMA), depending on the physical location of the data. (Bottom-left) Graphics Processing Units (GPUs) are composed by several Streaming Multiprocessors (SM) (e.g. 30 and 16 in the second and third generation of NVIDIA GPUs, respectively). Each SM is made up of a number of cores (8 and 32, respectively) that share a register file and a local memory. All the SMs share the global device memory. In the third generation, a hierarchy of cache memory is provided. In particular, a L2 cache level is between the SMs and the device memory. (bottom-right) Hybrid CPU + GPU computing on a computer equipped with multicore processors and multiple GPUs. The system keeps a pool of tasks to do. A number of threads to be mapped to CPU cores (denoted by C-threads) are running concurrently in the system. Also, specific threads (denoted by G-threads) in charge of the tasks to be computed on the GPUs are also running. The tasks are asynchronously dispatched to the threads on-demand. In the figure, allocation of tasks to threads are color-coded. Note that the G-threads will request tasks more often than the C-threads as GPUs make the calculations faster than a single CPU core. Moreover, faster GPUs will be assigned work more frequently than modest GPUs.

Relevant publications: