PhD Dissertation

High-Throughput Computation through Efficient Resource Management

Scientific applications run on supercomputers where thousands of nodes are shared among users. When those applications start, their resources remain allocated until the job ends. We have detected two potential approaches in resource managing, with which we increase the global throughput and provide a better utilization of the underlying resources.

Ongoing Projects

Dynamic Resource Reallocation for Adaptive Workloads

MPI applications can change the number of processes to fit better any specific computational stage or upon request from the manager. We have developed a communication layer between the manager and the runtime with the purpose of reconfiguring processes of any running application.integration of new programming models into a resource manager capable of reassigning resources to the jobs depending on the cluster status.

Past Projects

GSaaS: GPU Scheduling as a Service

Management of cloudified accelerators in clouds infrastructures.

rCUDA - remote CUDA

rCUDA is a virtualization solution which allows to share GPUs among the nodes in a cluster. SLURM is a workload manager able to schedule jobs and manage resources. In this project I have been in charge of the integration of both technologies, since RCUDA have not got the feature of managing workloads and SLURM does not know how to share resources such as GPUs. Nowadays, the RCUDA project offers this integration by applying a patch to SLURM.

REALCLOUD - Real Data Center Cloud Services and Environment

This project was carried out by several entities. We were responsible for developing a middleware what was able to consolidate the system, making decisions depending on the TI data gathered in real-time. So that, it would migrate virtual machines, turn on and shut down nodes in order to boost the performance and reduce, as much as possible the carbon footprint.

MONICA - Monitoring and control system with intelligent energy efficiency management for ICT resources in ultradense data centers oriented HPC and Cloud Computing

The collaboration of our research group in this project was focus on the theoretical study of the power consumption in the cluster of the FCSCL (Super Computing Foundation of Castilla y León (Spain)).

ACEI - Adjusting the Energy Consumption in Computer Facilities

This project consisted in the development of a simulator to assess energy saving strategies and policies in HPC workloads. The real system Energy Saving Cluster (ESC) based on Sun Grid Engine (SGE) was modeled in order to simulate its behavior, taking into account: the different features of the components in the cluster, the scheduling and the energy saving policies and generating statistics and charts with the results. The simulator was written in Python and had a user web interface for its management.