Deepak Majeti‎ > ‎

Research & Education

Research Interest:  Build computational frameworks by combining the best of High-Performance Computing and Big Data Technologies

Expertise: Heterogeneous Architectures, Parallel Programming languages, Parallel Runtimes, Compiler Optimizations, Algorithm Design, Big Data Architectures


Rice University, Houston, Texas, USA 
Doctor of Philosophy, Computer Science                                          August 2010 – May 2015

Indian Institute of Technology Kanpur, Uttar Pradesh, India 
M.Tech. Computer Science and Engineering                                     August 2008 – May 2010 

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India
B. Tech. Information and Communication Technology                      August 2004 – May 2008
  • GPA: 9.15/10.0
  • Advisor: Professor Sunitha V. Murugan

Summary of Research

With Dennard scaling coming to an end, today we see different processors with diverse architectural features.
Heterogeneity is everywhere; right from your mobile phones, laptops, servers and even supercomputers. The problem is that existing languages and applications are not able to tap into the full potential of modern heterogeneous processors.
As part of my research in the Habanero Extreme Scale Research Project lead by Prof. Vivek Sarkar, I extended existing languages and applications to fully take advantage of modern architectures.

Heterogeneous Architectures
Applications must meet existing standards of Portability, Productivity and Performance on these newer hardware

I was also a graduate student member of the Center for Domain-Specific Computing.

Research Projects

Heterogeneous Habanero-C (H2C): A Portable Programming Model for Heterogeneous Architectures 

  • Implemented H2C by adding new language constructs to Habanero-C to support heterogeneous architectures. These constructs are designed to fully take advantage of these newer architectures.

Compute Constructs
Data Constructs
Synchronization Constructs
forasync point(args) range(args) at(dev-list) < seq() scratchpad() >
async copyin(args) copyout(args) at(dev-list) finish{}; await(args); phaser-next;
forasync is a data+task parallel loop Construct.
There is no barrier at the end of the loop.
The point clause is used to specify the loop indexes.
The range clause is used to specify the loop iteration domain.
The at clause is used to specify the devices where execution takes place.
The optional seq clause is used to specify the work-item sizes.
The optional scratchpad clause is used to promote a data region
to take advantage of scratchpad buffers like local shared memory.
async spawns an asynchronous task to copy the data
to(copyin) and from(copyout) the device(dev-list)
specified in the at clause.
finish ensures all the tasks launched.
asynchronously insides its scope are completed.
await construct waits until all the dependencies are satisfied.
phaser-next provide fine grain synchronization support.

H2C Framework
Overall H2C Framework for Heterogeneous Architectures 

 Highlights of H2C

  • Meta Data Layout Framework: Compiler automatically generates code with the data layout specified in the meta file.
  • Task Partitioning and Data Distributions: The programmer specifies the partitioning of tasks and the compiler automatically determines the data distributions.
  • Hybrid Scheduling: High level constructs allow the user to specify the partition of the iteration space on the CPU and GPU.

Concord Programming Model

  • Concord is an Intel Threading Building Block extension that currently targets integrated CPU+GPU processors that do not share the virtual address spaces.
  • Non-shared virtual address spaces between processors restricts programmability. One cannot share pointers between the devices. 
  • To overcome this limitation, we implement a Shared Virtual Memory (SVM) layer in the Concord programming system. 
  • We also support many C++ features. Recursion, system calls and exceptions are some of the few features not supported.
  • Concord is now an open source project available at iHRC and is in use by product groups in Intel and research groups in academia including UT Austin, UCSD, Saarland University.  
Shared Virtual Memory

Shared Virtual Memory Implementation in Concord

Lattice Boltzmann Method Simulation (LBM): A Case Study

  • The goal of this project was to parallelize a sequential implementation of a Lattice Blotzmann Method simulation from Halliburton Services. 
  • We successfully reduced the execution time of the application from 4 days on a single CPU to just 4 minutes on a single node GPU
  • This project highlighted many practical problems with respect to portability and productivity in parallelizing real applications. It is important for modern programming frameworks to handle these issues.
LBM Operator

Fluid simulation using the LBM operator

Work Stealing on TI Keystone-II (CPU + DSP cores)

  • In this project we build a work-stealing runtime across ARM + DSP cores. As shown in the figure below, Keystone-II has 4 ARM + 8 DSP cores
  • There is a shared memory (MSMC) between the ARM and DSP cores that can be used as a software managed cache (similar to GPU local memory). 
  • The presence of hardware queues makes it interesting to experiment with "work-stealing techniques" across the ARM and DSP cores. 

TI-Keystone II
TI Keystone-II(Hawking) Architecture
image source:

Asynchronous task creation and scheduling on the SCC processor

  • We built a runtime for Habanero-C on the Intel Single Chip Cloud Processor. 
  • We support Async, Flat Finish and Hierarchical Place constructs.

Hierarchical Place Tree mapping on the SCC processor

COHX: Chapel On HSA + XTQ

  •  In this project, we extend Chapel to take advantage of AMD's Heterogeneous System Architecture (HSA) + eXtended Task Queueing (XTQ).

Task Graphs in Habanero-C

  • We develop a Task Graph generation module for the Habanero-C compiler infrastructure.
  •  We modify the Habanero-C compiler to insert function calls before/after every async/finish constructs in an Habanero-C program. The Task Graph is generated with the help of these functions. A single threaded execution of the Habanero-C program generates the Task Graph. 
  •  The user can also use these function calls to analyze and debug Habanero-C programs.
HC Task Graphs

Task Graph For Fib(3)

Implementing OpenMP-4.0 constructs

  • To be updated.

Speculative Loop Parallelization

  • Multicores are now ubiquitous. Many applications are currently sequential and do not utilize the cores efficiently.
  • Speculative parallelization is a promising technique to expose parallelism in loops with irregular, possibly input-dependent data dependencies that are impossible to resolve statically. 
  • As part of my master's thesis, under the guidance of Prof. Sanjeev Kumar Aggarwal, I designed an efficient lock-free framework for Speculative loop parallelization.
  • The lock-freedom ensures scalability and also minimizes the overhead due to run-time scheduling. The framework handles the speculations with bare minimal memory overhead and handles pointer intensive loops. 

shopify visitor statistics