Bharat Kaul

Director: Intel Labs - Parallel Computing Lab (India)

(LinkedIn, Google Scholar, Twitter)

Major Accomplishments

Open sourced OpenOmics Acceleration Framework
BEST Paper Award for SIGMA at HPCA'20
Best Research Poster at International Supercomputing'19 in AI/ML Track
Best Research Poster Award RECOMB'19
Intel Labs Highest Impact Research Award for 2018 and 2016 (Gordy'19 & Gordy'17) and nominee for Gordy'20
Intel adopts BFLOAT-16 for Deep Learning Training
MPI-Forum Adopts MPI-Endpoint Proposal
First Unstructured Driving Data from India (ECCV'18)
Contributions to India's AI National Strategy (ICTAI/CORE) - ICTAI announced with Intel
Large Batch Training for Deep Learning (ICLR'17)
Intel MKL-DNN Library
Intel Machine Learning Scaling LIbrary
ILSVRC'15 Top 20
Best Paper Finalist at SC'14
#1 on Green-500 (Nov 2012)

Our lab seeks to be an industry role-model for application-driven architecture research for highly parallel compute intensive applications. With access to latest computing technologies and computing at scale, we work in close collaboration with leading academic and industry co-travelers to implication of emerging applications on future of computing. The application domains include Deep Learning, Artificial Intelligence, Data Mining, Natural Language Processing, Computer Vision, Computational Biology, Computational Sciences and more.

The research focus is on 1) Creating breakthrough algorithms in the application domain 2) highly parallel and scalable algorithms 3) middle-ware and compilers to enable world at large 4) redefine the emerging computing architecture of the future to best meet the needs of these class of algorithms.

Our researchers routinely achieve higher than acceptance rate at Tier-1 conferences. At the same time, our lab has successfully translated our research into high impact across Intel road-map and future of computing architectures via close collaborations inside Intel.

Lab Alumni who I have had the privilege to work with:

Sunil Sherlekar (CEO SankhyaSutra Labs), Anand Deshpande (CEO Asquared IOT), Satya Gautam Vadlamudi (Foundation Work, Startup), Dheevatsa Mudigere (NVIDIA), Srinivas Sridharan (NVIDIA), Nataraj Jammalamadakka (AMZN Lab 126), Ninad Kothari (John Hopkins), Ganesh Bikshandi, Aniruddha Shet, Karthikeyan Vaidyanathan (Intel), Sangeeta Bhattacharya (Intel), Kiran Pamnany, Kunal Banerjee, Sanket Tavaragere, Ishwar Bhati (Intel), Naveen Mellempudi (AMD)

PUBLICATIONS

2023

A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication”: Sudarshan Srinivasan; Contributors: Victor da Cruz Ferreira, Prof. Sandip Kundu(Umass Amherst), Sandeep Bal(Umass Amherst), Chandra Sekhar(Umass Amherst), accepted at DATE 2023
“AutoSparse: Towards Automated Sparse Training of Deep Networks”; Abhisek Kundu, Naveen Mellempudi, Dharma Teja Vooturi, Bharat Kaul, Pradeep Dubey; submitted to ICLR’23”. https://openreview.net/forum?id=zyfEWkV6it
“DistGNN-MB: Distributed Large-scale Graph Neural Network Training on x86 via Minibatch Sampling,” Vasimuddin Md, Ramanarayan Mohanty, Sanchit Misra, Sasikanth Avancha; submitted to MLSys 2023. https://arxiv.org/abs/2211.06385
“ChemTSv2: Functional Molecular Design Using de novo Molecule Generator”; Shoichi Ishida (Yokohama City University), Tanuj Aasawat (PCL, Intel), Masato Sumita (RIKEN, National Institute for Materials Science), Michio Katouda (RIST), Tatsuya Yoshizawa (Yokohama City University), Kazuki Yoshizoe (Kyushu University), Koji Tsuda (RIKEN, University of Tokyo), Kei Terayama (RIKEN, Tokyo Institute of Technology); accepted at Wiley, Computational Molecular Science (July 2023; Impact Factor: 11.5), https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.1680
“DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing”. Yangtian Zhang (Mila), Zuobai Zhang (Mila), Bozitao Zhong (Mila), Sanchit Misra (PCL, Intel), Jian Tang (Mila); accepted at NeurIPS’23.
“PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design”. Chuanrui Wang (Mila), Bozitao Zhong (Mila), Zuobai Zhang (Mila), Narendra Chaudhary (PCL, Intel), Sanchit Misra (PCL, Intel), Jian Tang (Mila); accepted at NeurIPS’23 workshop AI4D3.
“Intel Xeon is all you need for AI inference: Performance Leadership on Real World Applications” <link>, blog published, disclosing Intel achieving highest reported industry performance on Google’s AlphaFold2 inference and DeepVariant pipelines
“GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis”; Yufeng Gu (UMich), Arun Subramaniyan (UMich), Tim Dunn (UMich), Alireza Khadem (UMich), Kuan-Yu Chen (UMich), Somnath Paul (Intel), Md Vasimuddin (Intel), Sanchit Misra (Intel), David Blaauw (UMich), Satish Narayanasamy (UMich), and Reetuparna Das (UMich); accepted at the International Symposium on Computer Architecture (ISCA 2023).

2022

Intel -AWS, jointly published a blog on Open Omics work load performance in the cloud
Intel’s HLS Industry brief - Intel Open Omics detailed with performance numbers on key workloads
Intel Labs Accelerates Single-cell RNA-Seq Analysis Blog under Intel Communities/Blogs/Tech Innovation/Artificial Intelligence (AI); June, 2022.
Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models, ISCA-2022: Saeed Rashidi, William Won, Sudarshan Srinivasan, Tushar Krishna
xploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models, arxiv 2022; William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna
“Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data.” HiComb 2022, IPDPS Workshop: Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman and Bharat Kaul
Measuring frequency and period separations in red-giant stars using machine learning: Siddharth Dhanpal, Othman Benomar, Shravan Hanasoge, Abhisek Kundu, Dattaraj Dhuri, Dipankar Das, Bharat Kaul. The Astrophysical Journal: https://arxiv.org/abs/2202.07599
Accelerating minimap2 for long-read sequencing applications on modern CPUs. Saurabh Kalikar, Chirag Jain, Md Vasimuddin, Sanchit Misra. Nature Computational Science 2 (2), 78-83, Feb, 2022. https://rdcu.be/cHVAK

2021

GenomicsBench: A Benchmark Suite for Genomics. Arun Subramaniyan, Yufeng Gu, Timothy Dunn, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021.https://ieeexplore.ieee.org/document/9408208
LISA: Learned indexes for sequence analysis. Darryl Ho, Saurabh Kalikar, Sanchit Misra, Jialin Ding, Vasimuddin Md, Nesime Tatbul, Heng Li, Tim Kraska. bioRxiv 2020.12.22.423964; doi: https://doi.org/10.1101/2020.12.22.423964.
"Accelerating Identification of Chromatin Accessibility from noisy ATAC-seq Data using Modern CPUs." Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman, and Bharat Kaul. bioRxiv (2021). https://www.biorxiv.org/content/10.1101/2021.09.28.462099v1.abstract
Tensor processing primitives: a programming abstraction for efficiency and portability in deep learning workloads

Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Narendra Chaudhary, Abhisek Kundu, Denise Kutnick, Frank Laub, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, Hans Pabst, Barukh Ziv, Alexander Heinecke

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; https://dl.acm.org/doi/pdf/10.1145/3458817.3476206

“DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks”; Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K. Ahmed, Sasikanth Avancha; https://dl.acm.org/doi/10.1145/3458817.3480856
“Efficient and Generic 1D Dilated Convolution Layer for Deep Learning”; Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman, Bharat Kaul; https://arxiv.org/abs/2104.08002
“AI Powered Compiler Techniques for DL Code Optimization” Sanket Tavarageri, Gagandeep Goyal, Sasikanth Avancha, Bharat Kaul, Ramakrishna Upadrasta; https://arxiv.org/abs/2104.05573
MADRaS : Multi Agent Driving Simulator: Anirban Santara, Sohan Rudra, Sree Aditya Buridi, Meha Kaushik, Abhishek Naik, Bharat Kaul, Balaraman Ravindran has been accepted for publication in the Journal of Artificial Intelligence Research 2021: https://arxiv.org/abs/2010.00993
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives : Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Bharat Kaul, Gagandeep Goyal, Ramakrishna Upadrasta In ACM Transactions on Architecture and Code Optimization (TACO). Presented at High Performance and Embedded Architecture and Compilation (HiPEAC) conference, January 2021. https://dl.acm.org/doi/abs/10.1145/3433103
ASTRA-sim: Enabling SW/HW Co-Design Exploration for Distributed Deep Learning Training Platforms: Saeed Rashidi, Tushar Krishna (Georgia Tech), Sudarshan Srinivasan (Intel); ASPLOS 2021. (https://asplos-conference.org/tutorials/)
“MINT: Microarchitecture for Efficient and Interchangeable Compression Formats on Tensor Algebra”; Eric Qin (Georgia Tech), Geonhwa Jeong (Georgia Tech), Jonghoon Won (Georgia Tech), Sheng-Chun Kao (Georgia Tech), Hyoukjun Kwon (Georgia Tech), Sudarshan Srinivasan (Intel), Dipankar Das (Intel), Gordon E. Moon (Sandia National Laboratories), Sivasankaran Rajamanickam (Sandia National Laboratories), Tushar Krishna (Georgia Tech) Accepted at to IPDPS’21
A Lightweight Error-Resiliency Mechanism for Deep Neural Networks: Brunno F. Goldstein, Victor C. Ferreira, Sudarshan Srinivasan, Dipankar Das, Alexandre S. Nery, Sandip Kundu and Felipe M. G. França accepted at ISQED 2021
GNNerator: A Hardware/Software Framework for Accelerating Graph Neural Networks: Jacob R Stevens (Purdue ), Dipankar Das, Sasikanth Avancha, Bharat Kaul, Prof Anand Raghunathan (Purdue) accepted at DAC’21
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives : Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Bharat Kaul, Gagandeep Goyal, Ramakrishna Upadrasta In ACM Transactions on Architecture and Code Optimization (TACO). Presented at High Performance and Embedded Architecture and Compilation (HiPEAC) conference, January 2021. https://dl.acm.org/doi/abs/10.1145/3433103
SEERL : Sample Efficient Ensemble Reinforcement Learning Accepted at AAMAS’21: Rohan Saphal, Dheevatsa Mudigere, Sasikanth Avancha, Bharat Kaul, Prof B. Ravindran (IIT-M)
Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms, “ISCA 2021:Saeed Rashidi (Georgia Institute of Technology) Matthew Denton (Georgia Institute of Technology), Sudarshan Srinivasan (Intel), Srinivas Sridharan (Facebook), Amoghavarsha Suresh (Stony Brook University), Jade Nie (Facebook), Tushar Krishna (Georgia Tech)

2020

The Case for a Learned Sorting Algorithm: Ani Kristo, Kapil Vaidya, Uğur Çetintemel, Sanchit Misra, Tim Kraska at SIGMOD’2020
Harnessing Deep Learning via a Single Building Block: Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, Alexander Heinecke at International Parallel & Distributed Processing Symposium (IPDPS), May 2020, (accepted)
Reliability Evaluation of Compressed Deep Learning Models: Brunno Goldstein, Sudarshan Srinivasan, Dipankar Das, Kunal Banerjee, Sandip Kundu, Felipe M.G. França accepted at 11th IEEE LASCAS - Latin American Symposium on Circuits and Systems – LASCAS 2020
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives. Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Gagandeep Goyal, Ramakrishna Upadrasta, Bharat Kaul. Preprint on arXiv. https://arxiv.org/abs/2002.02145
ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms : Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna In Proc of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Aug 2020
Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms: Saaed Rashidi , Srinivas Sridharan, Sudarshan Srinivasan, Matthew Denton∗ and Tushar Krishna. arxiv paper (https://arxiv.org/pdf/2007.00156.pdf)
Optimizing deep learning recommender systems training on CPU cluster architectures: Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, and Alexander Heinecke. . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '2020). IEEE Press, Article 43, 1–15. https://dl.acm.org/doi/10.5555/3433701.3433758
Benchmarking Learned indexes: Ryan Marcus, Andreas, Kipf, Alexander Van Renen, Mihai Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska. VLDB'2020. https://dl.acm.org/doi/10.14778/3421424.3421425
Deep Graph Library Optimizations for Intel(R) x86 Architecture: Sasikanth Avancha, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, 2020. https://arxiv.org/abs/2007.06354

2019

X-MANN: A Crossbar based Architecture for Memory Augmented Neural Networks: Ashish Ranjan, Shubham Jain, Jacob R Steven, Dipankar Das, Bharat Kaul, Anand Raghunathan At DAC’19
Manna - An Accelerator for Memory-Augmented Neural Networks: Jacob Stevens, Ashish Ranjan, Dipankar Das, Bharat Kaul, Anand Raghunathan At IEEE/ACM International Symposium on Microarchitecture (MICRO-52).
Training Google Neural Machine Translationon on an Intel CPU Cluster: Dhiraj Kalamkar, Kunal Banerjee, Sudarshan Srinivasan, Srinivas Sridharan, Evangelos Georganas, Mikhail E. Smorkalov, Cong Xu, Alexander Heinecke at International Conference on Cluster Computing (CLUSTER), September 2019, (accepted)
Optimizing Deep Learning RNN Topologies on Intel Architecture. Kunal Banerjee, Evangelos Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina Anderson, Alexander Heinecke. Supercomputing Frontiers and Innovations, vol. 6, no. 3, 2019, pp: 64-85, invited paper.
Optimizing Deep Learning LSTM Topologies on Intel Xeon Architecture. Kunal Banerjee, Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke. ISC High Performance, June 2019, Research Poster. Received "Best Research Poster Award" in "Artificial Intelligence and Machine Learning" track
Mixed Precision Training With 8-bit Floating Point: Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul arXiv preprint arXiv:1905.12334
A Study of BFLOAT16 for Deep Learning Training; Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, Pradeep Dubey; Preprint on arXiv. arXiv:1905.12322
High-Performance Deep Learning via a Single Building Block: Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, Alexander Heinecke; Preprint on arXiv. https://arxiv.org/pdf/1906.06440.pdf
SEERL: Sample Efficient Ensemble Reinforcement Learning; Rohan Saphal, Balaraman Ravindran, Dheevatsa Mudigere, Sasikanth Avancha and Bharat Kaul; Scaling-Up Reinforcement Learning (SURL) Workshop, IJCAI 2019. pdf
Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems: Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. . IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019.
Accelerating Sequence Alignment to Graphs: Chirag Jain, Sanchit Misra, Haowen Zhang, Alexander Dilthey, Srinivas Aluru. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019
dMazeRunner: Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, Aviral Shrivastava at ESWEEK 2019 (CODES+ISSS track)
High Performance Scalable FPGA Accelerator for Deep Neural Networks: Sudarshan Srinivasan, Pradeep Janedula, Saurabh Dhoble, Sasikanth Avancha, Dipankar Das, Naveen Mellempudi, Bharat Daga, Martin Langhammer, Gregg Baeckler Preprint on arXiv.
K-TanH: Hardware Efficient Activations For Deep Learning: Abhisek Kundu, Sudarshan Srinivasan, Eric C. Qin, Dhiraj Kalamkar, Naveen K. Mellempudi, Dipankar Das, Kunal Banerjee, Bharat Kaul, Pradeep Dubey. Preprint on arXiv, September 2019, arXiv:1909.07729.
LISA: Towards Learned DNA Sequence Search: Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md, Tim Kraska. Workshop on Systems for ML at NeurIPS 2019 [preprint]
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training: Dipankar Das, Ishawar Bhati, Sasikanth Avancha, Sudarshan Srinivasan, Mahesh Vutukuri and Supratim Pal

2018

A Progressive Batching L-BFGS Method for Machine Learning; Raghu Bollapragada, Dheevatsa Mudigere, Jorge Nocedal, Hao-Jun Michael Shi, Ping Tak Peter Tang; Conference paper, long talk at ICML 2018. arXiv:1802.05374, optimization-online
Mixed Precision Training of Convolutional Neural Networks using Integer Operations: Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov. International Conference on Learning Representations (ICLR), April 2018, pp: 1-11.
On Scale-out Deep Learning Training for Cloud and HPC; Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey; Poster at SysML, 2018. SysML'18, arXiv:1801.08030
Ternary Residual Networks; Abhisek Kundu, Kunal Banerjee, Naveen Mellempudi, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, Pradeep Dubey; Poster at SysML, 2018. SysML'18, arXiv:1707.0467
Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-out Classifiers: Apoorv Vyas13 , Nataraj Jammalamadaka , Xia Zhu , Dipankar Das , Bharat Kaul , and Theodore L. Willke At ECCV’18:
Release first Driving Dataset from India - India Driving Dataset at AutoNUE workshop at ECCV’18:
Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures: Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, Alexander Heinecke at International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), November 2018, pp: 66:1-66:12.
Hierarchical Block Sparse Neural Networks; Dharma Teja Vooturi, Dheevatsa Mudigere, Sasikanth Avancha; Preprint on arXiv 2018. arXiv:1808.03420
Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting: Tony Pan, Sanchit Misra, Srinivas Aluru at Supercomputing, 2018.
Performance Extraction and Suitability Analysis of Multi- and Many-core Architectures for Next Generation Sequencing Secondary Analysis: Sanchit Misra, Tony Pan, Kanak Mahadik, George Powley, Priya N. Vaidya, Md Vasimuddin, Srinivas Aluru At Parallel Architectures and Compilation Techniques (PACT, 2018).
Identification of Significant Computational Building Blocks through Comprehensive Deep Dive of NGS Secondary Analysis Methods: Md Vasimuddin, Sanchit Misra, Srinivas Aluru. BioRxiv 301903.

2017

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima; Nitish. S. Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang; Oral presentation and paper at ICLR 2017. OpenReview, poster, arXiv:1609.04836v1, code
Planning for performance: persistent collective operations for MPI: Bradley Morgan, Daniel J Holmes, Anthony Skjellum, Purushotham Bangalore, Srinivas Sridharan At Proceedings of the 24th European MPI Users' Group Meeting
ScaleDeep – A Scalable Compute Architecture for Learning and Evaluating Deep Networks : Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, Anand Raghunathan At ISCA'17
RAIL-Risk Averse Learning; Anirban Santara, Abhishek Naik, Balaraman Ravindran, Dipankar Das, Dheevatsa Mudigere, Sasikanth Avancha, Bharat Kaul; Deep Reinforcement Learning Symposium, NIPS 2017. arXiv:1707.06658
Deep learning at 15pf: supervised and semi-supervised classification for scientific data: Thorsten Kurth, Jian Zhang, Nadathur Satish, Evan Racah, Ioannis Mitliagkas, Md Mostofa Ali Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Sridharan, Pradeep Dubey At SuperComputing'17
Ternary Neural Networks with Fine-Grained Quantization; Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, Pradeep Dubey; Preprint on arXiv, 2017. arXiv:1705.01462
Development of a Nodal DG Solver within the SU2 Framework; Edwin van der Weide, Jae hwan Choi, Dheevatsa Mudigere, Paul Urbanczyk, Juan J. Alonso; SU2 Developers Meeting, 2017.
Performance Optimizations for the SU2 Higher-Order DG-FEM Fluid Solver on the Intel Xeon Phi (KNL); Edwin van der Weide, Thomas D. Economon, Juan J. Alonso, Jae hwan Choi, Dheevatsa Mudigere, Alexander Heinecke, Gaurav Bansal; Presented at SIAM-CSE 2017 MS44 Efficiency of High-Order Methods on the 2nd Generation Intel Xeon Phi Processor.
Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point; Naveen Mellempudi, Abhisek Kundu, Dipankar Das, Dheevatsa Mudigere, Bharat Kaul; Preprint on arXiv, 2017. arXiv:1701.08978
Distributed Hessian-Free Optimization for Deep Neural Networks Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, Martin Takáč; Distributed Machine Learning Workshop, AAAI 2017. AAAI'17 paper, arXiv:1606.00511
Eliminating irregularities of protein sequence search on multicore architectures: Jing Zhang, Sanchit Misra, Hao Wang, Wuchun Feng At IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017.

2016

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent; Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidyanathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey; Preprint on arXiv, 2016. arXiv:1602.06709
On Customized Computer Arithmetic for Deep Neural Network; Ping Tak Peter Tang, Naveen Mellempudi, Dheevatsa Mudigere. Intel Arithmetic Symposium, 2016.
Intel® Xeon Phi™ Delivers Competitive Performance For Deep Learning—And Getting Better Fast - Blog on IA (Xeon-Phi) coverage for Baidu's DeepBench benchmark.
Performance optimizations for scalable implicit RANS calculations with SU2; Thomas D. Economon, Dheevatsa Mudigere, Gaurav Bansal, Alexander Heinecke, Francisco Palacios, Jongsoo Park, Mikhail Smelyanskiy, Juan J. Alonso, Pradeep Dubey; Journal of Computer & Fluids, February 2016. doi:10.1016/j.compfluid.2016.02.003
Scaling up Hartree–Fock calculations on Tianhe-2: Edmond Chow, Xing Liu, Sanchit Misra, Marat Dukhan, Mikhail Smelyanskiy, Jeff R. Hammond, Yunfei Du, Xiang-Ke Liao and Pradeep Dubey At The International Journal of High Performance Computing Applications, Volume 30 Issue 1, 2 2016, Pages 85-102.
Comparing runtime systems with exascale ambitions using the parallel research kernels: Rob F Van der Wijngaart, Abdullah Kayi, Jeff R Hammond, Gabriele Jost, Tom St John, Srinivas Sridharan, Timothy G Mattson, John Abercrombie, Jacob Nelson At International Super Computing'16
muBLASTP: database-indexed protein sequence search on multicore CPUs: Jing Zhang, Sanchit Misra, Hao Wang, Wuchun Feng At BMC Bioinformatics. 2016

2015

ImageNet ILSVRC 2015 Object classification/localization (CLS-LOC) submission; Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Natraj Jammalamadaka, Karthik Vaidyanathan; ILSVRC 2015. results
Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor: Sanchit Misra, Kiran Pamnany, Srinivas Aluru. At IEEE/ACM transactions on computational biology and bioinformatics, Volume 12, Issue 5, Sept-Oct. 1 2015.
Dtree: Dynamic task scheduling at petascale: Kiran Pamnany, Sanchit Misra, Md Vasimuddin, Xing Liu, Edmond Chow, Srinivas Aluru. At International SuperComputing Conference, 2015
GraphMat: high performance graph analytics made productive: Narayanan Sundaram, Nadathur Rajagopalan Satish, Md Mostafa Ali Patwary, Subramanya R Dulloor, Satya Gautam Vadlamudi, Dipankar Das, Pradeep Dubey, Proceedings of the VLDB Endowment July 2015
Computational Challenges and Optimization Techniques for CFD Applications on Modern Parallel Systems; Anand Deshpande, Dheevatsa Mudigere; Invited talk at the International Conference on High Performance Computing (HiPC), 2015.
High-Performance Algebraic Multigrid Solver Optimized for Multi-Core Based Distributed Parallel Systems; Jongsoo Park, Mikhail Smelyanskiy, Ulrike Meier Yang, Dheevatsa Mudigere, Pradeep Dubey At SuperComputing 2015
Improving concurrency and asynchrony in multithreaded MPI applications using software offloading: Karthikeyan Vaidyanathan, Dhiraj D Kalamkar, Kiran Pamnany, Jeff R Hammond, Pavan Balaji, Dipankar Das, Jongsoo Park, Bálint Joó At SuperComputing'2015
High-Performance, Modern Code Optimizations for Computational Fluid Dynamics - featured blog detailing the work on SU2 in collaboration with Aerospace Design Lab at Stanford, 2015.
Accelerating Computational Fluid Dynamics Code on Multi-/Many-Core Intel Platforms; Gaurav Bansal, Anand Deshpande, Paul Edwards, Alexander Heinecke, Michael Klemm, Dheevatsa Mudigere, Elmoustapha Ould-ahmed-vall, Mikhail Smelyanskiy, Michael Steyer, Nishant Agrawal, Ravi Ojha, Ambuj Pandey, Rihab Abdul Razak, Juan J. Alonso, Thomas D. Economon, Francisco Palacios, David Keyes; 27th International Conference on Parallel Computational Fluid Dynamic (ParCFD), 2015. doi:10.1.1.726.4213, pdf,
Exploring Shared-memory Optimizations for an Unstructured Mesh CFD application on Modern Parallel Systems; Dheevatsa Mudigere, Srinivas Sridharan, Anand Deshpande, Jongsoo Park, Alexander Heinecke, Mikhail Smelyanskiy, Bharat Kaul, Pradeep Dubey, Dinesh Kaushik, and David Keyes; IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015. doi:10.1109/IPDPS.2015.114, pdf
Towards High-Performance Optimizations of the Unstructured Open-Source SU2 Suite; Thomas D. Economon, Francisco Palacios, Juan J. Alonso, Gaurav Bansal, Dheevatsa Mudigere, Anand Deshpande, Alexander Heinecke, and Mikhail Smelyanskiy; AIAA SciTech, 2015. Also at SIAM-CSE 2015 MS302 PDE-constrained Optimization using the Open-source Code SU2. doi:10.2514/6.2015-1949, pdf
Using the parallel research kernels to study PGAS models: Rob F Van der Wijngaart, Srinivas Sridharan, Abdullah Kayi, Gabriele Jost, Jeff R Hammond, Timothy G Mattson, Jacob E Nelson At International Conference on Partitioned Global Address Space Programming Models

2014

Parallel Bayesian Network Structure Learning for Genome-scale Gene Networks: Sanchit Misra, Md Vasimuddin, Kiran Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, Maneesha R. Aluru, Srinivas Aluru At Supercomputing, 2014. Best paper finalist.
Lattice QCD with domain decomposition on Intel® Xeon Phi™ co-processors : Simon Heybrock, Bálint Joó, Dhiraj D Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, Pradeep Dubey At SuperComputing'14
Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpoints: Srinivas Sridharan, James Dinan, Dhiraj D Kalamkar At SuperComputing'14
Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices: Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D Kalamkar, Xing Liu, Md Mosotofa Ali Patwary, Yutong Lu, Pradeep Dubey At SuperComputing'14
Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers : Alexander Heinecke, Alexander Breuer, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, Christian Pelties, Arndt Bode, William Barth, Xiang-Ke Liao, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Pradeep Dubey At SuperComputing'14
Parallel Mutual Information Based Construction of Whole-genome Networks on the Intel® Xeon Phi™ Coprocessor: Sanchit Misra, Kiran Pamnany, Srinivas Aluru At IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014.
Improving communication performance and scalability of native applications on intel xeon phi coprocessor clusters : Karthikeyan Vaidyanathan, Kiran Pamnany, Dhiraj D Kalamkar, Alexander Heinecke, Mikhail Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha Shet, Bharat Kaul, Bálint Joó, Pradeep Dubey At IPDPS'14
Delayed difference scheme for large scale scientific simulations; Dheevatsa Mudigere, Sunil Sherlekar, Santosh Ansumali; Physical Review Letters, Volume 113, Issue 21, Nov 2014. doi:10.1103/PhysRevLett.113.218701, pdf.

2013

Design and implementation of the linpack benchmark for single and multi-node systems based on intel® xeon phi coprocessor : Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, Pradeep Dubey At IPDPS'13
Lattice QCD on Intel® Xeon PhiTM Coprocessors : Balint Joo, Dhiraj D Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Kiran Pamnany, Victor W Lee, Pradeep Dubey, William Watson At International Super Computing'13
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors: ongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, Daehyun Kim At SuperComputing'13
On vectorization for lattice based simulations : Aniruddha G Shet, K Siddharth, Shahajhan H Sorathiya, Anand M Deshpande, Sunil D Sherlekar, Bharat Kaul, Santosh Ansumali At International Journal of Modern Physics C
Data structure and movement for lattice-based simulations, Aniruddha Shet, Shahajhan Sorathiya, Siddharth Krithivasan, Anand Deshpande, Bharat Kaul, Sunil Sherlekar and Santosh Ansumali, Phys. Rev. E 88, 013314 (2013). (http://pre.aps.org/abstract/PRE/v88/i1/e013314)

2012

#1 On Green-500 with Intel XeonPhi (Nov 2012): http://www.green500.org/lists/green201211/
High performance non-uniform FFT on modern x86-based multi-core systems: Dhiraj D Kalamkar, Joshua D Trzaskoz, Srinivas Sridharan, Mikhail Smelyanskiy, Daehyun Kim, Armando Manduca, Yunhong Shu, Matt A Bernstein, Bharat Kaul, Pradeep Dubey At IPDPS'12
Improving the performance of dynamical simulations via multiple right-hand sides: Xing Liu, Edmond Chow, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy At IPDPS'12
Optimization of geometric multigrid for emerging multi- and manycore processors: Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian van Straalen, Mikhail Smelyanskiy, Ann S. Almgren, Pradeep Dubey, John Shalf, Leonid Oliker At SuperComputing'12
Extending the BT NAS parallel benchmark to exascale computing: Rob F Van der Wijngaart, Srinivas Sridharan, Victor W Lee At SuperComputing'12
Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures: Mikhail Smelyanskiy, Jason Sewall, Dhiraj D. Kalamkar, Nadathur Satish, Pradeep Dubey, Nikita Astafiev, Ilya Burylov, Andrey Nikolaev, Sergey Maidanov, Shuo Li, Sunil Kulkarni, Charles H. Finan, Ekaterina Gonina At SuperCompting'12