Jongsoo Park

A member of technical staff at OpenAI
A former principal research scientist at Meta AI Systems Co-design
Ph.D. from Department of Electrical Engineering, Stanford University
- Concurrent VLSI Group, Advisor: Professor, Bill Dally
B.S. from Department of Electrical Engineering, Seoul National University
Previously, a research scientist at Intel Parallel Computing Lab, an Intern at VMware, and a software engineer at Penta Security Systems
Contact: JONGSOO "dot" park AT gmail "dot" com

Open Source Projects

FBGEMM: high performance kernels on CPU and GPU that are not readily available (at least at the time of development) in vendor provided libraries, including low-precision GEMMs (for CPU) and embedding operations (for GPU and CPU)
PyTorch: #44 contributor as of now but most of them were for Caffe2 :)
SkimCaffe: sparse convolutional neural network
SpMP: SParse Matrix Pre-processing library. Fast sparse triangular solver, and matrix reorderings like BFS and reverse-Cuthill-Mckee
Sparso: Julia package to automate high-level optimizations for sparse linear algebra like inspector-executor and reordering
SPLATT: sparse tensor factorization
SOI-FFT: segment-of-interest low-communication FFT algorithm

Publications: Google scholar, github

Scaling Llama 3 Training with Efficient Parallelism Strategies, Weiwei Chu et al., accepted to be published at ISCA industry track, 2025
Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences, Joel Coburn et al., accepted to be published at ISCA industry track, 2025
Context Parallelism for Scalable Million-Token Inference, Amy Yang et al., 2024, accepted to be published at MLSys, 2025

The Llama 3 Herd of Models, 2024
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation, Liang Luo et al., accepted to be published at MLSys, 2024
Wukong: Towards a Scaling Law for Large-Scale Recommendation, Buyun Zhang et al. accepted to be published at ICML, 2024

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure, Mark Zhao et al., MLSys, 2023
Shared Microexponents: A Little Shifting Goes a Long Way, Bita Rouhani et al., ISCA industry track, 2023
AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models, Fan Lai et al., OSDI, 2023
MTrainS: Improving DLRM training efficiency using heterogeneous memories, Hiwot Tadese Kassa et al.

Unity: A Unified Graph Representation and Runtime for Distributed DNN Training, Zhihao Jia et al., OSDI, 2022
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models, Dheevatsa Mudigere et. al., ISCA industry track, 2022
DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction, Buyun Zhang et al.

Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale, Zhaoxia Summer Deng et al., IEEE Micro, 2021, arxiv
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models, Mudigere et al., 2021
Efficient Soft-Error Detection for Low-precision Deep Learning Recommendation Models, Sihuan Li et al., 2021
First-Generation Inference Accelerator Deployment at Facebook. Michael Anderson et al., 2021
Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems, Xiaocong Du et al., 2021

Training Deep Learning Recommendation Model with Quantized Collective Communications, Jie (Amy) Yang, Jongsoo Park, Srinivas Sridharan, and Ping Tak Peter Tang, KDD Workshop on Deep Learning Practice for High-Dimensional Sparse Data, 2020
Mixed-Precision Embedding Using a Cache, Jie (Amy) Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, Andrew Tulloch, 2020
Adaptive Dense-to-Sparse Paradigm for Pruning Online Recommendation System with Non-Stationary Data, Mao Ye, Dhruv Choudhary, Jiecao Yu, Ellie Wen, Zeliang Chen, Jiyan Yang, Jongsoo Park, Qiang Liu, and Arun Kejariwal, 2020

Deep Learning Recommendation Model for Personalization and Recommendation Systems, code, Naumov et al., 2019
Zion: Facebook Next-Generation Large-Memory Unified Training Platform, HotChips, 2019, accepted
A Study of BFLOAT16 for Deep Learning Training, Kalamkar et al., 2019
Post-Training 4-bit Quantization of Embedding Tables, Hui Guan, Andrey Malevich, Jiyan Yang, Jongsoo Park, and Hector Yuen, accepted to Workshop on Systems for ML @ NuerIPS 19.

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications, 2018, talk
FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference, Daya Khudia, Jianyu Huang, Protonu Basu, Summer Deng, Haixin Liu, Jongsoo Park and Mikhail Smelyanskiy, International Workshop on the Intersection of High Performance Computing and Machine Learning
On Periodic Functions as Regularizers for Quantization of Neural Networks, Maxim Naumov, Utku Diril, Jongsoo Park, Benjamin Ray, Jedrzej Jablonski, and Andrew Tulloch, 2018
Spatial-Winograd Pruning Enabling Sparse Winograd Convolution, Jiecao Yu, Jongsoo Park, and Maxim Naumov, 2018
Dynamic Fine-Grained Sparse Memory Accesses, Berkin Akin, Chiachen Chou, Jongsoo Park, Christopher Hughes, Rajat Agarwal, MEMSYS, 2018
Glow: Graph Lowering Compiler Techniques for Neural Networks, Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy
HPC Formulations of Optimization Algorithms for Tensor Completion, Shaden Smith, Jongsoo Park, and George Karypis, Elsevier journal of parallel computing, journal version of our SC16 paper
Gate Scheduling for Quantum Algorithms, Gian Giacomo Guerreschi and Jongsoo Park

Enabling Sparse Winograd Convolution by Native Pruning, with Sheng Li and Ping Tak Peter Tang
Faster CNNs with Direct Sparse Convolutions and Guided Pruning, Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey, International Conference on Learning Representations (ICLR), 2017, accepted for publication, github
Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory, Shaden Smith, Jongsoo Park and George Karypis, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2017, accepted for publication
Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory, Jongsoo Park, invited to present at SIAM Conference on Computational Science and Engineering (CSE'17), slides

Holistic SparseCNN: Forging the Trident of Accuracy, Speed, and Size, Jongsoo Park, Sheng R. Li, Wei Wen, Hai Li, Yiran Chen, and Pradeep Dubey, github, version submitted to ICLR
Sparso: Context-driven Optimizations of Sparse Linear Algebra, Hongbo Rong, Jongsoo Park, Lingxiang Xiang, Todd A. Anderson, and Mikhail Smelyanskiy, International Conference on Parallel Architectures and Compliation Techniques (PACT), 2016, github
Automating Wavefront Parallelization for Sparse Matrix Codes, Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Strout, and Mary Hall, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2016, nominated as a best paper finalist
An Exploration of Optimization Algorithms for High Performance Tensor Completion, Shaden Smith, Jongsoo Park, and George Karypis, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2016, nominated as a best student paper finalist, invited to present at SIAM Conference on Computational Science and Engineering (CSE'17), SPLATT library
Performance optimizations for scalable implicit RANS calculations with SU2, Thomas D. Economon, Dheevatsa Mudigere, Gaurav Bansal, Alexander Heinecke, Francisco Palacios, Jongsoo Park, Mikhail Smelyanskiy, Juan J. Alonso, and Pradeep Dubey, Journal on Computers and Fluids, 2016

High-Performance Algebraic Multigrid Solver Optimized for Multi-Core Based Distributed Parallel Systems, Jongsoo Park, Mikhail Smelyanskiy, Ulrike Meier Yang, Dheevatsa Mudigere, and Pradeep Dubey, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2015. This paper presents optimization methodologies of algebraic multigrid solver, an important method for exa-scale solvers due to its optimal O(N) complexity for a certain class of problems (e.g., elliptical PDE). We used BoomerAMG in HYPRE library, a widely used algebraic multigrid solver, as an optimization example. Some of our optimizations are incorporated in the HYPRE main branch from version 2.11.0
Improving Concurrency and Asynchrony in Multithreaded MPI Applications Using Software Offloading, Karthikeyan Vaidyanathan, Dhiraj D. Kalamkar, Kiran Pamnany, Jeff R. Hammond, Pavan Balaji, Dipankar Das, Jongsoo Park, and Balint Joo, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2015
Optimizations in High-Performance Conjugate Gradient Benchmark for IA-based Multi and Many-core Processors, Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Md. Mostofa Ali Patwary, Vadim Pirogov, Pradeep Dubey, Xing Liu, Carlos Rosales, Cyril Mazauric, and Christopher Daley, International Journal of High Performance Computing
Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms, Md. Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J Anderson, Satya Gautam, Dipankar Das, Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey, International Supercomputing Conference (ISC), 2015
Exploring Shared-memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems, Dheevatsa Mudigere, Srinivas Sridharan, Anand Deshpande, Jongsoo Park, Alexander Heinecke, Mikhail Smelyanskiy, Bharat Kaul, Pradeep Dubey, Dinesh Kaushik, and David Keyes, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and Its Application to Unstructured Matrices, Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mostofa Ali Patwary, Yutong Lu, and Pradeep Dubey, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2014, pdf. HPCG is a new sparse linear systems solver benchmark that complements HPL for dense matrix operations. This paper describes our implementation that ranked top positions of the first HPCG list, press release. For more recent results, please refer to slides and BoF presentation.
Sparsifying Synchronizations for High-Performance Shared-Memory Sparse Triangular Solver, Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey, International Supercomputing Conference (ISC), 2014, pdf, included in Intel MKL Optimized Technology Preview, talk at ASCR HPCG workshop, open sourced at github
Versatile and Scalable Parallel Histogram Construction, Wookeun Jung, Jongsoo Park, and Jaejin Lee, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2014, pdf, open sourced at github
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, Muhammad Hassan, Shubo Sengupta, Zhaoming Yin, and Pradeep Dubey, SIGMOD, 2014, pdf
Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters, Karthikeyan Vaidyanathan, Kiran Pamnany, Dhiraj D. Kalamkar, Alexander Heinecke, Mikhail Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha Shet G, Bharat Kaul, Bálint Joó, and Pradeep Dubey, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014, pdf
Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis, Jiwon Seo, Jongsoo Park, Jaeho Shin, and Monica S. Lam, International Conference on Very Large Data Bases (VLDB), 2014, project homepage, github, pdf

Tera-Scale 1D FFT with Low-Communication Algorithm and Intel Xeon Phi Coprocessors, Jongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, and Daehyun Kim, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2013, pdf
Location-Aware Cache Management for Many-Core Processors with Deep Cache Hierarchy, Jongsoo Park, Richard M. Yoo, Daya S. Khudia, Christopher J. Hughes, and Daehyun Kim, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2013, pdf

A Framework for Low-Communication 1-D FFT, Ping Tak Peter Tang, Jongsoo Park, Daehyun Kim, and Vladimir Petrov, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Best paper, 2012, pdf, included in Intel Math Kernel Library, also published in Journal of Scientific Programming, Vol. 21
Efficient Backprojection-Based Synthetic Aperture Radar Computation with Many-Core Processors, Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, and Thomas Benson, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Best paper finalist, 2012, pdf, also published in Journal of Scientific Programming, Vol. 21
Billion-Particle SIMD-Friendly Two-Point Correlation on Large-Scale HPC Cluster Systems, Jatin Chhugani, Changkyu Kim, Hemant Shukla, Jongsoo Park, Pradeep Dubey, John Shalf, and Horst D. Simon, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Gordon Bell award finalist, 2012, pdf, open source with Lawrence Berkely National Laboratory, github
CloudRAMSort: Fast and Efficient Large-scale Distributed RAM Sort on Shared-Nothing Cluster, Changkyu Kim, Jongsoo Park, Nadathur Satish, Hongrae Lee, Pradeep Dubey, and Jatin Chhugani, SIGMOD industrial session, 2012, pdf

Memory Optimizations of Embedded Applications for Energy Efficiency, Jongsoo Park, Stanford University Ph.D. Dissertation, 2011
Fine-grain Dynamic Instruction Placement for L0 Scratch-pad Memory, Jongsoo Park, James Balfour, and William J. Dally, International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES), 2010, pdf, talk
Buffer-space Efficient and Deadlock-free Scheduling of Stream Applications on Multi-core Architectures, Jongsoo Park and William J. Dally, Symposium on Parallelism in Algorithms and Architectures (SPAA), 2010, pdf, talk
Maximizing the Filter Rate of L0 Compiler-Managed Instruction Stores by Pinning, Jongsoo Park, James Balfour, and William J. Dally, Technical Report 126, Concurrent VLSI Architecture Group, Stanford University, 2009, pdf
A Practical Improvement to the Partial Redundancy Elimination in SSA Form, Jongsoo Park and Jaejin Lee, JCSE, 2008, Vol. 2, No. 3, pdf
Hierarchical Instruction Register Organization, David Black-Schaffer, James D. Balfour, William J. Dally, Vishal Parikh and Jongsoo Park, Computer Architecture Letters, 2008, Vol. 7, No. 2
Efficient Embedded Computing, William J. Dally, James D. Balfour, David Black-Schaffer, James Chen, R. Curtis Harting, Vishal Parikh, Jongsoo Park and David Sheffield, IEEE Computer, July 2008
An Energy-Efficient Processor Architecture for Embedded Systems, James D. Balfour, William J. Dally, David Black-Schaffer, Vishal Parikh and Jongsoo Park, Computer Architecture Letters, 2008, Vol. 7, No. 1
Register Pointer Architecture for Efficient Embedded Processors, Jongsoo Park, Sung-Boem Park, James D. Balfour, David Black-Schaffer, Christos Kozyrakis and William J. Dally, Proceedings of the Conference on Design Automation and Test in Europe (DATE), 2007, pdf

Hobbies

Photography
Writing computer games: these are *really* old games that will run only with emulators
- Galaxy Fighter: 1996, download
- Sigmacraft: 1998, download

Google Sites

Report abuse