Welcome!

Hi, I am Mostofa Patwary, Director of Large Foundational Language Model at the Applied Deep Learning Research team at NVIDIA. My research interests span in the areas of Natural Language Processing, Large Scale Deep Learning, and High Performance Computing. I lead the large foundation language model training team that is resposible for data curation, model training, evaluation and the uses in real world applications.

Previously, I worked as a senior researcher at the Silicon Valley AI Lab at Baidu Research, Parallel Computing Lab at Intel Research, and at the Northwestern University in Illinois.

I received my PhD from the Department of Informatics at University of Bergen, Norway. As part of the PhD program, I also studied as a research scholar at Purdue University, USA.

My bachelor and masters degrees are from the Department of Computer Science and Engineering at Bangladesh University of Engineering and Technology (BUET), Bangladesh.

Email: mostofa dot patwary at gmail dot com

Recent Blogs:

Recent Selected Contributions:

Adding Instructions during Pretraining: Effective way of Controlling Toxicity in Language Models, (EACL 2023).
Context Generation Improves Open Domain Question Answering, (Findings of EACL 2023).
Factuality Enhanced Language Models for Open-Ended Text Generation, (NeurIPS 2022).
Multi-Stage Prompting for Knowledgeable Dialogue Generation (ArXiv 2022)
Detoxifying Large-Scale Language Models (ArXiv 2022)
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (World's Largest Dense Langauge Model, 2022).
End-to-End Training of Neural Retrievers for Open-Domain Question Answering (ACL 2021)
Scaling Language Model Training to a Trillion Parameters Using Megatron (Best student paper award, SC 2021)
Large Scale Multi-Actor Generative Dialog Modeling (ACL 2020)
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (~300 citations)
Training Question Answering Models From Synthetic Data (EMNLP 2020)
BioMegatron: Larger Biomedical Domain Language Model (EMNLP 2020)
MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models (EMNLP 2020)
Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data (SC 2017)

Publications:

Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi and Bryan Catanzaro, Adding Instructions during Pretraining: Effective way of Controlling Toxicity in Language Models, (EACL 2023), 2023.
Dan Su, Mostofa Patwary, Shrimai Prabhumoye, et. al., Context Generation Improves Open Domain Question Answering, (Findings of EACL 2023), 2023.
Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, Bryan Catanzaro, Factuality Enhanced Language Models for Open-Ended Text Generation, NeurIPS 2022.
Peng Xu, Mostofa Patwary, Shrimai Prabhumoye, et. al., Evaluating Parameter Efficient Learning for Generation, EMNLP 2022.
Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai Prabhumoye, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro, Multi-Stage Prompting for Knowledgeable Dialogue Generation, (Findings of ACL 2022), 2022.
Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro, Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models, 2022.
Shaden Smith*, Mostofa Patwary*, Brandon Norick, et. al., Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model, 2022, (* denotes equal contribution).
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia, "Efficient Large-Scale Language Model Training on GPU Clusters", International Conference on High Performance Computing, Networking, Storage and Analysis (Best student paper award, Supercomputing, SC'21), 2021.
Assefaw Gebremedhin, Mostofa Patwary, Fredrik Manne, "Paradigms for Effective Parallelization of Inherently Sequential Graph Algorithms on Multi-Core Architectures", Handbook of Research on Methodologies and Applications of Supercomputing, Pages 74-95, 2021.
Devendra Singh Sachan, Mostofa Patwary, Mohammad Shoeybi, Neel Kant, Wei Ping, William L Hamilton, Bryan Catanzaro, "End-to-End Training of Neural Retrievers for Open-Domain Question Answering", Accepted to 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), 2021.
Sashank Santhanam, Wei Ping, Raul Puri, Mohammad Shoeybi, Mostofa Patwary, Bryan Catanzaro, "Local Knowledge Powered Conversational Agents", arXiv preprint arXiv:2010.10150, 2020.
Alex Boyd, Raul Puri, Mohammad Shoeybi, Mostofa Patwary, Bryan Catanzaro, "Large Scale Multi-Actor Generative Dialog Modeling", Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pages 66-84, 2020.
Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, Animashree Anandkumar, Bryan Catanzaro, "Controllable Story Generation with External Knowledge Using Large-Scale Language Models", Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Pages 2831-2845, 2020.
Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, "Training Question Answering Models From Synthetic Data", Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Pages 5811–5826, 2020.
Joel Hestness, Gregory Diamos, Hee Woo Jun, Sharan Narang, Newsha Ardalani, Mostofa Patwary, Zhou Yanqi, "Predicting deep learning scaling", US Patent Application number 16206910, 2020.
Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, Raghav Mani, "BioMegatron: Larger Biomedical Domain Language Model", Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Pages 4700-4706, 2020.
Adam Rupe, Nalini Kumar, Vladislav Epifanov, Karthik Kashinath, Oleksandr Pavlyk, Frank Schlimbach, Mostofa Patwary, Sergey Maidanov, Victor Lee, Mr Prabhat, James P Crutchfield, "DisCo: Physics-Based Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems", 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), Pages 75-87, 2019.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, "Megatron-LM: Training Multi-Billion Parameter Language Models Using GPU Model Parallelism", arXiv preprint arXiv:1909.08053, 2019.
Mostofa Patwary, Milind Chabbi, Heewoo Jun, Jiaji Huang, Greg Diamos, Kenneth Church, "Language modeling at scale", 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), pp. 590-599, 2019.
Asit K Mishra, Deborah T Marr, Jong Soo Park, Nadathur Rajagopalan Satish, Mikhail Smelyanskiy, Michael Anderson, Mostofa Ali Patwary, Narayanan Sundaram, Sheng Li, "Sorting data and merging sorted data in an instruction set architecture", US Patent, Application number 10198264, 2019.
Jiayi Huang, Mostofa Patwary, Gregory Diamos, "Coloring big graphs with alphagozero", arXiv preprint arXiv:1902.10162, 2019.
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, Yanqi Zhou, "Deep Learning Scaling is Predictable, Empirically", arXiv preprint 1712.00409, 2017.
Brian Friesen, Mostofa Patwary, Brian Austin, Nadathur Satish, Zachary Slepian, Narayanan Sundaram, Deborah Bard, Daniel J Eisenstein, Jack Deslippe, Pradeep Dubey, "Galactos: computing the anisotropic 3-point correlation function for 2 billion galaxies", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17), 2017.
Thorsten Kurth, Jian Zhang, Nadathur Satish, Evan Racah, Ioannis Mitliagkas, Mostofa Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Sridharan, and Pradeep Dubey, "Deep learning at 15PF: supervised and semi-supervised classification for scientific data", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'17), 2017.
Arif Khan, Alex Pothen, Mostofa Patwary, Nadathur Satish, Narayanan Sundaram, Fredrik Manne, Mahantesh Halappanavar, and Pradeep Dubey, “Efficient Approximation Algorithms for Weighted b-Matching”, SIAM Journal on Scientific Computing (SISC 2016), 2016.
Arif Khan, Alex Pothen, Mostofa Patwary, Nadathur Satish, Narayanan Sundaram, Mahantesh Halappanavar, and Pradeep Dubey, “Computing b-Matching to scale on distributed memory multiprocessors by approximation”, International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'16), 2016.
Mostofa Patwary, Nadathur Satish, Narayanan Sundaram, Jialin Liu, Peter Sadowski, Evan Racahc, Surendra Byna, Wahid Bhimji, Craig Tull, Prabhat, and Pradeep Dubey, “PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures”, IEEE International Parallel and Distributed Processing Symposium (IPDPS'16), 2016.
Michael J Anderson, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Ted Willke, and Pradeep Dubey, "GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms", IEEE International Parallel and Distributed Processing Symposium (IPDPS'16), 2016.
Mostofa Patwary, Suren Byna, Nadathur Satish, Narayanan Sundaram, Zarija Lukic, Vadim Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat and Pradeep Dubey, "BD-CATS: Big Data Clustering at Trillion Particle Scale", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'15), pp. 6:1--6:12, 2015.
Narayanan Sundaram, Nadathur Satish, Mostofa Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey, "GraphMat: high performance graph analytics made productive", VLDB Endowment 2015, pp. 1214-1225, 2015.
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur, Narayanan Sundaram, Mostofa Patwary, Ryan Adams, and Prabhat, "Scalable bayesian optimization using deep neural networks", International Conference on Machine Learning (ICML'15), 2015.
Mostofa Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G. Pudov, Vadim O. Pirogov, Pradeep Dubey, "Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms", High Performance Computing (ISC'15), pp. 48-57, 2015.
Mostofa Patwary, Nadathur Satish, Narayanan Sundaram, Fredrik Manne, Salman Habib, and Pradeep Dubey, "PARDICLE: Parallel Approximate Density-Based Clustering", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'14), pp. 560-571, 2014.
Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Mostofa Patwary, Yutong Lu, and Pradeep Dubey, "Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and Its Application to Unstructured Matrices", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'14), pp. 945-955, 2014.
Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Jiwon Seo, Jongsoo Park, Muhammad Hasan, Shubo Sengupta, Zhaoming Yin, and Pradeep Dubey, "Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets" ACM SIGMOD 2014, pp 979-990, 2014.
Diana Palsetia, Mostofa Patwary, William Hendrix, Ankit Agrawal, and Alok Choudhary, "Clique Guided Community Detection". In IEEE International Conference on Big Data, IEEE Bigdata, 2014.
Assefaw H. Gebremedhin, Duc Nguyen and Alex Pothen, and Mostofa Patwary, "ColPack: Graph Coloring Software for Derivative Computation and Beyond", ACM Transactions on Mathematical Software (TOMS 2013 ), Volume 30, No. 1, pp. 1:1-1:31, 2013.
Mostofa Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary, "Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques", International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'13), pp. 49:1-49:12, 2013.
William Hendrix, Diana Palsetia, Mostofa Patwary, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary, "A Scalable Algorithm for Single-Linkage Hierarchical Clustering on Distributed-Memory Architectures", IEEE Symposium on Large Scale Data Analysis and Visualization (LDAV'13), pp. 7-13, 2013.
Yusheng Xie, Zhengzhang Chen, Kunpeng Zhang, Mostofa Patwary, Yu Cheng, Haotian Liu, Ankit Agrawal, and Alok Choudhary, "Graphical Modeling of Macro Behavioral Targeting in Social Networks", Proceedings of SIAM International Conference on Data Mining (SDM'13), pp.740-748, 2013.
Bharath Pattabiraman, Mostofa Patwary, Assefaw H. Gebremedhin, Wei-keng Liao, and Alok Choudhary , "Fast Algorithms for the Maximum Clique Problem on Massive Sparse Graphs", 10th Workshop on Algorithms and Models for the Web Graph (WAW 2013), Lecture Notes in Computer Science, pp. 156-169, 2013.
Chen Jin, Mostofa Patwary, Ankit Agrawal, William Hendrix, Wei-keng Liao, and Alok Choudhary, "DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce", Proceedings of SC workshops, The Forth International Workshop on Data Intensive Computing in the Clouds (DataCloud 2013), 2013.
Chen Jin, Qiang Fu, Huahua Wang, Ankit Agrawal, William Hendrix, Wei-Keng Liao, Mostofa Patwary, Arindam Banerjee, and Alok Choudhary, "Solving Combinatorial Optimization Problems using Relaxed Linear Programming: A High Performance Computing Perspective", Proceedings of KDD workshops, 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine-13), pp. 39-46, 2013. (Best Paper Award)
William Hendrix, Mostofa Patwary, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary, "Parallel Hierarchical Clustering on Shared Memory Platforms", Proceedings of the 19th IEEE International Conference on High Performance Computing (HiPC'12), pp.1-9, 2012.
Yuhong Zhang, Sanchit Misra, Ankit Agrawal, Mostofa Patwary, Wei-keng Liao, and Alok Choudhary, "Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power", BMC Bioinformatics, Volume 13 (Suppl 5):S3, 2012.
Diana Palsetia, Mostofa Patwary, Kunpeng Zhang, Kathy Lee, Christopher Moran, Yves Xie, Daniel Honbo, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary, "User-Interest based Community Extraction in Social Networks", Proceedings of KDD workshops, The 6th Workshop on Social Network Mining and Analysis (SNA-KDD), 2012.
Mostofa Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary, "A New Scalable Parallel DBSCAN Algorithm Using the Disjoint Set Data Structure", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Supercomputing, SC'12), pp.62:1-62:11, 2012.
Mostofa Patwary, Peder Refsnes, and Fredrik Manne, "Multi-core spanning forest algorithms using the disjoint-set data structure", Proceedings of 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS'12), pp. 827-835, 2012.
Mostofa Patwary, Assefaw H. Gebremedhin, and Alex Pothen, "New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures", Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer LNCS 6853, pp. 250–262, 2011.
Johannes Langguth, Mostofa Patwary, and Fredrik Manne, "Parallel Algorithms for Bipartite Matching Problems on Distributed Memory Computers", Parallel Computing, Volume 37, Issue 12, pp. 820-845, December, 2011.
Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Mostofa Patwary, Ankit Agrawal, and Alok Choudhary, "Twitter Trending Topic Classification", Proceedings of ICDM Workshops, The 6th Workshop on Optimization Based Methods for Emerging Data Mining Problems (OEDM), pp 251-258, 2011.
Yuhong Zhang, Mostofa Patwary, Sanchit Misra, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary, "Enhancing Parallelism of Pairwise Statistical Significance Estimation for Local Sequence Alignment", Proceedings of HiPC Workshops, The Workshop on Hybrid Multicore Computing, pp. 1-8, 2011.
Mostofa Patwary, Jean Blair and Fredrik Manne, "Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure", Proceedings of 9th International Symposium on Experimental Algorithms (SEA'10), Springer LNCS 6049, pp. 411–423, 2010.
Mostofa Patwary and Md. Saidur Rahman, "Minimum Face-Spanning Subgraphs of Plane Graphs", AKCE International Journal of Graphs and Combinatorics, Volume 7, No. 2, pp. 133-150, December, 2010.
Mostofa Patwary, Rob H. Bisseling and Fredrik Manne, "Parallel Greedy Graph Matching using an Edge Partitioning Approach", Proceedings of ICFP Workshops, the Fourth ACM SIGPLAN Workshop on High-level Parallel Programming and Applications (HLPP 2010), pp. 45-54, September, 2010.
Mostofa Patwary, Jean Blair and Fredrik Manne, "Efficient Union-Find implementations", Technical Report 393, Department of Computer Science, University of Bergen, Norway, 2010. http://www.uib.no/ii/forskning/reports-in-informatics/reports-in-informatics-2010-2019/
Fredrik Manne and Mostofa Patwary, "A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers", Proceedings of Eighth International Conference on Parallel Processing and Applied Mathmatics (PPAM'09), Springer LNCS 6067, vol. 1, pp. 186–195, 2009.
Mostofa Patwary and Md. Saidur Rahman, "Minimum Face-Spanning Subgraphs of Plane Graphs", Proceedings of WALCOM'07, pp. 62-75, 2007.
Abdullah Al Muzahid, Ahmed Khurshid, Mostofa Patwary, Md. Mostofa Akbar and Masud Karim Khan, "Reservation Based Adaptive Uplink Admission Control for WCDMA", Proceedings of First International Conference on Next-Generation Wireless Systems, pp. 119-123, 2006.