I am no longer updating this page. Please have a look at my new webpage.
Greetings!! My Name is Suraj Kumar. I am a postdoctoral researcher in Alpines team at Inria Paris. Previously, I was also a postdoctoral researcher at Pacific Northwest National Laboratory. I completed my PhD in 2017 at Inria Bordeaux under the supervision of Olivier Beaumont, Emmanuel Agullo, Lionel Eyraud-Dubois, and Samuel Thibault. [ PhD Thesis ]
Prior to Joining PhD program, I worked at IBM India Research Lab for 17 months in High Performance Analytics group.
I am interested in all aspects of High Performance Computing (HPC). Presently, my research focuses on designing parallel and scalable algorithms for tensor computations as well as their theoretical analysis (in terms of computations and communications) and implementations for the state-of-the-art HPC systems. In past, I have worked on scheduling, design of new runtime systems, and performance analysis & optimization for heterogeneous systems. Most of my work was related to prove bounds for the scheduling algorithms or to obtain the maximum performance from the systems.
Feel free to check out my work and if you have any questions, or would like to collaborate, don't hesitate to contact me.
Contact Information
Alpines team, Inria Paris
2 Rue Simone IFF, 75012 Paris
Email: suraj.kumar@inria.fr
Phone: +33 (0) 782966289
Education
Ph.D, Storm Team, Inria Bordeaux , France (Dec 2013 -- May 2017)
Master, Computer Science and Engineering, Indian Institute of science, Bangalore, India, Jul 2012
B-Tech, Computer Science and Engineering, Sikkim Manipal Institute of Technology, Sikkim, India, Jul 2010
CV[Resume]
Publications
IEEE International Parallel & Distributed Processing Symposium (IPDPS 2020), May 2020, New Orleans (Virtual), Louisiana, USA.
International Conference on Parallel Processing (ICPP 2019), August 2019, Kyoto, Japan.
Concurrency and Computation: Practice and Experience, Wiley, 2018, 30(17).
IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017), May 2017, Orlando, Florida, USA.
Olivier Beaumont, Terry Cojean, Lionel Eyraud-Dubois, Abdou Guermouche, Suraj Kumar. Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources
International Conference on High Performance Computing, Data, and Analytics (HiPC 2016), Dec 2016, Hyderabad, India.
Emmanuel Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, Suraj Kumar. Are Static Schedules so Bad ? A Case Study on Cholesky Factorization
IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016), May 2016, Chicago, IL, United States. IEEE, 2016.
Emmanuel Emmanuel, Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Suraj Kumar et al. Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms
Heterogeneity in Computing Workshop 2015, May 2015, Hyderabad, India. 2015.
A Narang, S Kumar, J Soman, M Perrone, D Wade, K Bendiksen, V Slatten, TE Rabben. Maximizing TTI RTM Throughput for CPU+ GPU
75th EAGE Conference & Exhibition incorporating SPE EUROPEC 2013
Ankur Narang, Suraj Kumar, Ananda S Das, Michael Perrone, David Wade, Kristian Bendiksen, V Slatten, Tor Erik Rabben. Performance Optimizations for TTI RTM on GPU based Hybrid Architectures
Biennial International Conference & Exposition, 2013.
M Anandhavalli, Suraj Kumar Sudhanshu, Ayush Kumar, MK Ghose. Optimized association rule mining using genetic algorithm
Advances in information mining, ISSN 9753265, Volume 1, Issue 2, 2009.
Posters
Scheduling of Cholesky Factorization with Lookahead Information
Suraj Kumar, HiPC 2016.
Scheduling Strategies and Bounds for Cholesky Factorization on Heterogeneous Platforms
Suraj Kumar, SC 2016.
Scheduling of Task-Based Linear Algebra Kernels on Heterogeneous Resources
Suraj Kumar, IPDPS 2016 PhD Forum.
Recent Talks
Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels at ICPP 2019, Kyoto, Japan
Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs at IPDPS 2017, Orlando, Florida, USA
Scheduling of Dense Linear Algebra Kernels on Heterogeneous Resources, PhD defense, 12 April 2017, Inria Bordeaux, France
Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources at HiPC2016, Hyderabad, India
Are Static Schedules so Bad ? A Case Study on Cholesky Factorization at IPDPS 2016, Chicago, Illinois, USA
Are Static Schedules so Bad ? A Case Study on Cholesky Factorization at STORM/TADaaM/HiePACS team meeting on 18-March 2016
Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms at Algorithmique Distribuée Working Group on 21-Sept 2015
Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous platforms at HCW 2015, In conjunction with IEEE IPDPS 2015, Hyderabad, India
Professional Services
External reviewer for the 48th International Conference on Parallel Processing (ICPP 2019).
Reviewer for the following international journals: IJPP since March 2018, CALC since March 2020, SIMAX since May 2020, TOMS since May 2020.
Key Contributions and Software Tools
Here is a summary of some of my recent and/or relevant contributions.
Parallel algorithms for tensor approximation: I focused on parallization of tensor train decomposition and approximation algorithms. Tensor train format represents a high dimensional tensor with a group of smaller dimension tensors. This format is extensively used in molecular and quantum simulations to obtain qualitative properties of quantum mechanics. I proposed a parallel algorithm to compute tensor train decomposition of a tensor and proved that the ranks of the tensor train format obtained by my algorithm are bounded by the ranks of unfolding matrices of the tensor. I also proposed a parallel algorithm to compute approximation of a tensor in tensor train format. Now I am implementing the approximation algorithm for distributed memory parallel systems.
This is an ongoing work with Laura Grigori at Inria Paris.
Strategies to maximize communication-computation overlap: I considered the problem of determining the order of data transfers between two memory nodes for a set of independent tasks with the objective of maximizing communication-computation overlap. I proposed an optimal algorithm to determine the order when there is not any memory capacity restriction. I also proved that with limited memory capacity the problem is NP-complete. I proposed several heuristics to determine this order. I assessed all the proposed heuristics for two molecular chemistry applications, namely, HartreeFock and Coupled Cluster Singles Doubles.
This work has been conducted during my postdoc at Pacific Northwest National Laboratory with Sriram Krishnamoorthy and Lionel Eyraud-Dubois.
A lightweight runtime system for molecular simulations: I designed and developed a task-based runtime system, which performs well on the modern HPC systems for tensor operations of molecular chemistry simulations. It required me to deal with several challenges such as expressing different types of tasks in a unified framework in modern C++, specifying access mode and permission for a data block, issuing prefetch in Global Array framework, communication computation overlap, heterogeneity of resources, etc. This work was part of the NWChemEx Exascale Computing Project.
This work has been conducted during my postdoc at Pacific Northwest National Laboratory with Sriram Krishnamoorthy and Marcin Zalewski.
Scheduling of dense linear algebra kernels on heterogeneous architectures: Recently, many dynamic runtime systems have been proposed to schedule task graphs on platforms consisting of highly heterogeneous resources. The real challenges is to design efficient schedulers for such runtimes to make effective utilization of all resources. I considered the problem of scheduling dense linear algebra applications on fully heterogeneous platforms made of CPUs and GPUs.
Performance bounds of task graphs: graphs: Peak theoretical performance of a system is deceptive and hard to achieve. The performance bound of a task graph on a system helps one to assess the quality of any schedule and the scope of improvement. This also helps one to assess relative performance expectations of future systems for different applications. I used linear programming formulations to provide bounds by considering all computations and dependencies of an application, and all computing resources. These bounds are proposed for a single node composed of CPUs and GPUs and assume that communications are completely overlapped with computations. This work was added in StarPU, a runtime developed at Inria Bordeaux.
Dynamic scheduling strategies with static rules: Most runtime systems make their decision dynamically based on basic information such as the expected duration of task on each type of resource and the expected duration of the communication of its input data. I proposed and investigated how adding static rules based on an offline analysis of the problem into dynamic schedulers improve the overall performance of the application. This work was implemented inside StarPU.
A resource centric scheduling strategy: I proposed and evaluated a new class of scheduling strategy, HeteroPrio, for two types of resources. This is based on affinity between tasks and resources. In this strategy, resources select their favorable tasks. GPUs execute tasks with higher acceleration ratios and CPUs with lower acceleration ratios. I introduced several corrections in HeteroPrio to avoid idle time on the fast resource and assessed them for linear algebra kernels. I also studied theoretical performance guarantee of this strategy for a set of independent tasks and task graphs, and extended this strategy to handle multiple resources.
All the above work has been conducted during my PhD at Inria Bordeaux with Olivier Beaumont, Emmanuel Agullo, Lionel Eyraud-Dubois and Samuel Thibault. The code is available in the public version of StarPU.
Performance optimizations of TTI RTM on GPU based hybrid architectures: TTI RTM algorithm is widely used in seismic imaging. It has huge computational cost which makes it challenging for large scale exploration. Most time consuming kernels of TTI RTM algorithm solve stencil computations, which are embarrassingly parallel. I developed GPU code for these kernels and applied different compiler optimizations such as loop unrolling, divergence reduction, prefetching manually in the code. This work has been conducted during my tenure at IBM Research with Ankur Narang and Jyothish Soman.