Research Projects

Current Research

Modelling memory access pattern to predict memory traffic between LLC and DRAM

To define memory or compute intensiveness, operational intensity from Roofline Model is widely used. In order to determine the operational intensity of a kernel we need to calculate the DRAM RW byte count from LLC to DRAM. Moreover, DRAM byte count can only be found after running a kernel and it varies for different processors based on their memory hierarchy. Because of the dynamic nature, it is hard to model DRAM traffic without profiling and also varies based on the memory access pattern of the kernels. As of now, we have been using operational intensity and depending on profiling data to find the value of DRAM R/W. This is a bottleneck to integrate our method into a runtime system. As we target to predict DRAM RW traffic we need an application model where memory access pattern is considered and we also need a machine model where cache hierarchy is considered. In the end, COMPASS (Figure 1) annotates the kernels with predicted information which will be used by the runtime while making scheduling decisions.


Resource contention-aware scheduling for extreme heterogeneous systems: at ORNL (Spring 19 - Present)

When we co-run kernels in different processors in a heterogeneous system where processors interact with system memory, they contend with each other. Because of the contention, we observe a slowdown in execution time and a change in power consumption. In our studies, we defined this memory contention in terms of energy and performance and built an empirical model to anticipate the impact of memory contention to achieve the desired balance between energy and performance. The main idea is to use operational intensity to define the compute-memory intensiveness of kernels when they are collocated or in other words run together in different processors. Based on the operational intensities, our empirical model anticipates the impact of memory contention. We were able to show that our model and algorithms are capable of predicting memory contention with high accuracy and can provide reasonable energy and performance balance. In the specified timeline, the main objective is to build a model for memory contention for heterogeneous systems that extends the empirical model to encompass corner cases. Upon building this model, scheduling strategies will be built in a heterogeneous run time system. To accomplish this objective, COMPASS framework ( a combination of OpenARC and ASPEN model) will be used to implement and automate the empirical model. Then, the COMPASS framework will be interfaced with BRISBANE heterogeneous runtime system to implement intelligent scheduling policies. In the end, a feedback mechanism that depends on an ML framework will be implemented to refine the scheduling decision. Having implemented the total workflow, some other use-cases, e.g., unified memory access, intelligent DVFS, etc. will be explored for improving the model.


Phylanx: Auto tuning of HPX runtime using APEX and TAU: (Fall 2017 - Winter 2019)

APEX triggers autotuning policies to tune the HPX runtime system’s parameter to improve performance.

Parcel Coalescing: For adaptive auto-tuning of HPX using APEX measurement and instrumentation library, we have developed a toy application that sends messages to every other node. I have tested this application on Talapas (University of Oregon cluster) using two compute nodes (dual socket Intel Xeon processors). After having the proof of concept from Talapas, now I am working on to observe the impact of parcel coalescing policy for a real application at scale (physics simulation). For these experiments, I am using NERSC's Cray supercomputer, Cori. I am currently using Intel Xeon "Haswell" nodes and I plan to use Intel Xeon Phi "Knight's Landing" nodes as well.

Adaptive Task Inlining for HPX: We have worked adaptive task inlining for HPX runtime system to achieve a balance between parallelism and task overhead. We used, multicore Xeon, Xeon Phi and AMD processor for this study. Our finding is accepted in ICPP19.


Simulation in High Performance Computing environment. (2016-2017)

StingRay HPC Project: This project is a joint venture between the Computer Science and the Earth Science Department at UO which focuses on Seismic Ray Tracing. The Stingray Seismic Raytracer was originally based on Dijkstra’s single-source shortest path algorithm. Unfortunately, the algorithm’s inherent sequential nature limits its scalability. Our new Stingray-HPC implementation is based on the Bellman-Ford algorithm and demonstrates scalable performance in both execution time and problem size. Our tool is capable of decomposing a 3-D model in X, Y and Z direction based on the number of nodes used and run the individual parts in different nodes. Nodes can communicate to its neighbor and update ghost-cell to form the global model. We have developed MPI + OpenMP and MPI + Cuda provision in this tool. Based on the experiment results from XSEDE machine Stampede and Oakridge machine Titan it shows high scalability and speed-up.

Beetle Project: This project is a joint effort of the Computer Science and the Geography Department of UO. The idea is to simulate and study the phenomena of beetles attacking pine trees in the North-West region of US. We already have an implementation of the beetle project using Agent-based modeling tool Repast Symphony. Now I am working with a team to make this model run on large machines using Repast HPC. Repast HPC is the agent based modeling framework that provides all the modeling features and facilitates necessary inter-node communications. In our simulation, we are considering forest cell as an agent and we divide the whole forest into smaller slice which will run on individual nodes. In every node, the forest has independent agents which can communicate to each other to have a beetle transaction. Based on time, weather, temperature, and beetle density trees die. The main problem is in synchronizing these changes among different nodes. MPI is used to maintain the communication between processes.

TAU and PIN (2016)

Dynamic Binary instrumentation We used Intel PIN to insert TAU hook at runtime in binary for analyzing the performance using the features of TAU.


Energy-aware Cloud Computing: (2014-2015)

To meet the increasing demand of computational power, at present IT service providers’ should choose cloud based services for its flexibility, reliability and scalability. More and more datacenters are being built to cater customers’ need. However, the datacenters consume large amounts of energy, and this draws negative attention. To address those issues, researchers propose energy efficient algorithms that can minimize energy consumption while keeping the quality of service (QoS) at a satisfactory level. Virtual Machine consolidation is one such technique to ensure energy-QoS balance. In this research, we explore fuzzy logic and heuristic based virtual machine consolidation approach to achieve energy-QoS balance. A Fuzzy VM selection method is proposed in this research. It selects VM from an overloaded host. Additionally, we incorporate migration control in Fuzzy VM selection method that will enhance the performance of the selection strategy. A new overload detection algorithm has also been proposed based on mean, median and standard deviation of utilization of VMs. We have used CloudSim toolkit to simulate our experiment and evaluate the performance of the proposed algorithm on real-world work load traces of Planet lab VMs. Simulation results demonstrate that the proposed method is most energy efficient compared to others.