Research

My current research focuses on runtimes and computer architecture. I am working on the MECCA project, the goal of which is to codesign runtimes and architectures to overcome the challenges posed by parallelism exploitation, energy consumption (power) and predictability, i.e. meeting deadlines in multicore architectures.

PhD Students

I have been the main supervisor of the following phd students

  • Muhammad Shafiq (PhD, 2008-2012): "Architectural Explorations for Streaming Accelerators with Customized Memory Layouts". Currently at the Centre of Excellence in Science & Advanced Technologies (CESAT), Islamabad, Pakistan. Graduated May 2012

I have co-supervised the following phd students for the listed period

  • Branimir Dickov (2008-2015). Thesis title: "MPI Layer Techniques to Improve Network Energy Efficiency". Graduated December 2015
  • Tassadaq Hussain (2009-2011): On the topic of Programmable Memory Controllers. Graduated December 2014
  • Madhavan Manivannan (2014-2016). Thesis title: "Towards Runtime-Assisted Cache Management for Task-Parallel Programs". Graduated June 2016

Research Projects

During my research career, first at the Technical University of Catalonia, then at the Barcelona Supercomputing Center, then at the Tokyo Institute of Technology, and now at Chalmers University of Technology, I have been involved in many research projects which allowed me participate in several research communities: Embedded Computing, Computer Architecture, Reconfigurable Computing and High Performance Computing. My main interest is the design of computing models and computer architectures for supercomputers and embedded high performance devices.

Embedded Computing

I started working in Embedded Computing during my Master Thesis in 2002. I developed a power model for VLIW architectures that included wide functional units. A wide functional unit is similar to a SIMD unit but the identification of instruction groups was performed using a compiler approach based on MIRS_C (Modulo Scheduling with Integrated Register Spilling for Clustered Architectures). Besides my Master Thesis I was able to generate several publications as well [SAMOS03, ISHPC-V, IJHPCN2004, LNCS-3133, IJES2008]. I also worked on a cryptographic accelerator for the AES algorithm capable of multiplexing multiple AES streams using a single pipeline. The design was first developed around 2003. I finally implemented the engine on a Virtex-II FPGA during my internship at TU Delft in 2007 [Stamatis-2007, VECPAR-2008].

Register File Architecture

Computer Architecture was the main research field of my PhD years. I worked on two projects. First, from 2003 to 2005, I worked on extensions to make the future file microarchitecture more power efficient. I developed techniques centered on the register file. The main contribution was a register file that combined the features of the Future File and a Physical Register File [WCED2004]. This research was then extended to a very simple optimization to remove unnecessary writebacks which were easily detected in the new architecture [PACS2004]. By moving the merged register file to the front-end I developed an architecture that cleanly decoupled state from execution [ISHPC-VI]. Within this architecture it was easy to add a small extension to the pipeline to process simple integer instructions in-order directly in the "state" part of the pipeline. This led to the Chained In-Order/Out-of-order DoubleCore Architecture [SBAC-PAD-2005]. This was my final proposal before moving to Kilo-Instruction Processors, the topic of my PhD thesis.

Kilo-Instruction Processors

Kilo-Instruction Processors are processors designed to handle more than one thousand instructions in flight. Such a wide instruction window allows them to hide such long-latency events as main memory accesses. This is very beneficial particularly for scientific code which often consists of independent loops. The topic of Kilo-Instruction processors had been first researched by my thesis co-advisor Dr. Adrian Cristal. My task was to investigate low-complexity methods to design such processors. I started by analyzing the concept of Execution Locality, which is the phenomenon that program execution proceeds in fast bursts of sequences of "local instructions" separated by cache misses [ISHPC-VI]. My main contribution in the field of Kilo-Instruction Processors was a decoupled design called D-KIP (Decoupled Kilo-Instruction Processor) [HPCA2006], in which a first core (called Cache Processor) processed all low-latency events including L2 hits, and a second core (called Memory Processor) buffered and processed all the instructions depending on long latency events, mainly main memory accesses. I then proposed a new version of the D-KIP in which the Memory Processor consisted of an array of simple in-order dual-issue processors. Each of these processors handled a "memory epoch", which was then committed in bulk to the memory state. However, the most interesting feature was that the Memory Processor's cores could be shared among several front-ends in a multi-threaded environment, which allowed to improve the efficiency of the system. This design was called Flexible MultiCore (FMC) [PACT2007]. One final structure that I addressed during my phd thesis was the Load/Store Queue. Designing an LSQ for large window processors is very challenging due to the complex functionality that it needs to support. Using the concepts developed in the D-KIP and FMC, I proposed a two-level LSQ which could handle hundreds of loads and stores using traditional structures for local disambiguation, and a directory inspired scheme for global disambiguation [ISCA2008]. These three publication form the bulk of my PhD thesis, which can be accessed here.

Reconfigurable Computing

After my PhD Defense I joined the Barcelona Supercomputing Center as a Senior Researcher. After my experience at TUDelft I decided to work on the topic of Reconfigurable Supercomputing. I build a small team of researchers working on FPGA computing topics. Together with my student Dr. Shafiq Muhammad we worked on seismic imaging [FPT2009, TPDPS2011] and on efficient reconfigurable architectures. In this topic Dr. Muhammad developed a template system for the portable design of efficient reconfigurable accelerators [WRC2011, SASP2011, JSA2013] and he also developed a GPU-like architecture with a reconfigurable data frontend [CF2012, IBCAST2013]. Dr. Muhammad obtained his PhD in May 2012 from the Technical University of Catalonia. Together with my student Branimir Dickov, we developed an FPGA implementation for the Sparse Matrix-Vector Multiplication, capable of interleaving the processing of multiple rows resulting in a highly adaptive execution with a very simple implementation enabled by a modified input matrix format [WRC2010]. Branimir has since then moved to work on HPC (see below). Finally, I supervised PhD candidate Mr. Tassadaq Hussein on the development of programmable memory controllers [WRC2011, ARC2012]. As part of an internship, Mr Hussein also developed an implementation of the Reverse Time Migration algorithm using a High Level Synthesis tool called HCE [FPT2011].

High Performance Programming Models

In recent years I have finally settled in the field of High Performance Computing and extreme-scale computing. HPC connects the two fascinating fields of science and computers. In the field of supercomputing I have been experimenting with dataflow programming models applied to N-Body problems such as Barnes-Hut [IAAA01] or FMM [IPSJ-136]. I am particularly interested in researching programming models, runtimes and architectural support for future compute nodes with hundreds of cores. I am also involved in the development of performance analysis techniques for these environments. Most of this work has been conducted since my arrival at the Matsuoka Laboratory at TokyoTech. My main contribution has been a set of tools known as LOI/KRD [PMBS2013, ICS2014] designed to analyze concurrency, runtime overheads and work time inflation in task-parallel and task-dataflow codes. KRD is a tool for coarse-grained analysis of reuse distance that maps particularly well to task-dataflow parallelism and scales orders of magnitude better than fine-grained approaches. I am also collaborating with several PhD students in the analysis of data-flow versus fork-join models for multicores [ISC2013] and dataflow scheduling in heterogeneous CPU/GPU environments [IPSJ-136].

Scalable software and parallel architectures

Since 2014 I am involved in the MECCA and SCHEME projects, both lead by Prof. Per Stenström. We are working towards forms of parallelism management that control hardware via runtime support. One such direction which we have researched are schemes to better manage the hardware caches by exploiting implicit information on task-DAGs available in programming models such as OmpSs or OpenMP 4.x [HPCA2016,CAL2017]. I have also been working on techniques to develop resource-efficient software. This work has resulted in the XiTAO programming model [RESPA2015,PACT2016], which is a two level scheduler in which applications are composed of moldable tasks and a runtime elastically assigns portions of the system's resources. This combination provides performance portability across heterogeneous architectures. Within the context of XiTAO, we have also studied a novel mechanism for load balancing based on an approximate view of the system.

Besides these projects, I am also involved in the ACE project which is attempting to develop efficient computers based on approximate algorithms.