Taekyung Heo
Senior HPC Middleware Developer @ NVIDIA
As a computer scientist, I excel in the areas of distributed machine learning systems, performance modeling, and memory systems, with my efforts particularly focused on bridging the gap between theoretical concepts and their practical implementation through hardware-software co-design.
In my current role as a Senior HPC Middleware Developer at NVIDIA, I harness AI software/hardware co-design to scale AI workloads across thousands of GPUs effectively. This role encompasses a variety of benchmarking activities and the crafting of software solutions aimed at enhancing the scalability of machine learning workloads. I am at the forefront of developing two vital software products from scratch. This development work includes comprehensive benchmarking, simulation framework, and the extraction of valuable insights from real-world workload traces across diverse teams and business units. My rich engineering background, cultivated during my Ph.D. through hands-on exploration of both hardware and software aspects—including system software like Linux kernels and simulation tools for architectural studies—positions me to make substantial contributions to NVIDIA's projects.
Employment
Senior HPC Middleware Developer, NVIDIA, Dec. 2023 - present
Research Engineer II, Georgia Institute of Technology, Mar. 2023 - Dec. 2023
Supervisor: Tushar Krishna
Research Advisor IV @ Meta, Magnit, Nov. 2022 - Dec. 2023
Postdoctoral Fellow, Georgia Institute of Technology, Mar. 2022 - Feb. 2023
Supervisor: Tushar Krishna
Visiting Fellow @ Microsoft Research Asia, FA Talent, Feb. 2018 - Aug. 2018
Education
Doctor of Philosophy (Ph.D.), Computer Science, Korea Advanced Institute of Science & Technology (KAIST), Mar. 2016 - Feb. 2022
Dissertation: Redesigning Hardware and Software Stacks for Terabyte-Scale Memory Systems
Advisor: Jaehyuk Huh
Master of Science (M.Sc.), Computer Science, Korea Advanced Institute of Science & Technology (KAIST), Mar. 2014 - Feb. 2016
Thesis: Dynamic Time Slice Management Based on CPU Pooling in Virtualized Systems
Advisor: Jaehyuk Huh
Bachelor of Science (B.Sc.), Computer Engineering, Sungkyunkwan University, Mar. 2010 - Feb. 2014
Senior Thesis: Performance Analysis of the ext4 File System in Virtualization Environment
Advisor: Young Ik Eom
Publications
Taekyung Heo, Seunghyo Kang, Sanghyeon Lee, Soojin Hwang, Joongun Park, Jaehyuk Huh, "Supporting Trusted Virtual Machines with Hardware-based Secure Remote Memory", International Symposium on Memory Management (ISMM), June 2024
William Won*, Taekyung Heo*, Saeed Rashidi*, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, "ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale", International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023 [slides, video]
Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, and Alexandros Daglis, "COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training", arXiv, November 2022
Daehyeon Baek, Soojin Hwang, Taekyung Heo, Daehoon Kim, and Jaehyuk Huh, "InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing", International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2021 [slides]
Taekyung Heo, Yang Wang, Wei Cui, Jaehyuk Huh, and Lintao Zhang, "Adaptive Page Migration Policy with Huge Pages in Tiered Memory Systems", IEEE Transactions on Computers [code]
Jeongseob Ahn, Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh, "Accelerating Critical OS Services in Virtualized Systems with Flexible Micro-sliced Cores", European Conference on Computer Systems (EuroSys), April 2018 [slides, video, code]
Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh, “Hybrid TLB Coalescing: Improving TLB Translation Coverage under Diverse Fragmented Memory Allocations”, International Symposium on Computer Architecture (ISCA), June 2017 [slides]
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh, “Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching”, International Symposium on Computer Architecture (ISCA), June 2016 [slides]
Workshops
Taekyung Heo, Srinivas Sridharan, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, and Tushar Krishna, "Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces", Workshop on Modeling & Simulation of Systems and Applications (ModSim), August 2023 [slides, poster]
Taekyung Heo, Saeed Rashidi, Changhai Man, Divya Kiran Kadiyala, William Won, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Alexandros Daglis, and Tushar Krishna, "Exploring Memory Expansion Designs for Training Mixture-of-Experts Models", Workshop on Hot Topics in System Infrastructure (HotInfra), June 2023 [slides]
Srinivas Sridharan*, Taekyung Heo*, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, and Tushar Krishna, "Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces", Workshop on Benchmarking Machine Learning Workloads on Emerging Hardware (MLBench), June 2023 [slides]
Jianming Tong, Yangyu Chen, Yue Pan, Abhimanyu Bambhaniya, Alind Khare, Taekyung Heo, Alexey Tumanov, and Tushar Krishna, "FastSwtich: Enabling Real-time DNN Switching via Weight-Sharing", Workshop on Architecture, Compiler, and System Support for Multi-Model DNN Workloads (ACSMD), June 2022
Presentations & Talks
“Chakra and ASTRA-sim: An Open-source Ecosystem for Advancing Co-design for Future AI Systems”, AI & Systems Co-design Faculty Summit on behalf of Tushar Krishna, October 2023
"Chakra and ASTRA-sim: An Open-source Ecosystem for Advancing Co-design for Future AI Systems", ACE Monthly Meeting, September 2023
"Designing Multi-Tensor Core Systems in SST", SST User Group Meeting, September 2023
"Execution Trace (ET) Execution Through Simulator", Chakra Working Group Meeting, August 2023
"ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale", SRC Combined CADT/AIHW Annual Review, May 2023
"ASTRA-sim Tutorial", International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2023
"ASTRA-sim Tutorial", Conference on Machine Learning and Systems (MLSys), August 2022
"ASTRA-sim Tutorial", International Symposium on Computer Architecture (ISCA), June 2022
"ASTRA-sim Tutorial", International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), February 2022
"Hybrid TLB Coalescing: Improving TLB Translation Coverage under Diverse Fragmented Memory Allocations", Korea Software Congress, December 2017
Research Experiences
Chakra: Advancing Benchmarking and Codesign with Standardized Execution Traces (GitHub)
Aims at improving pre-silicon codesign and benchmarking for distributed ML through a standardized trace format and reference tools
Acted as one of main developers of the project, focusing on standardization and tool development
Member of the Chakra working group in MLCommons
Presented at MLBench 2023 and ModSim 2023
Honored with the Dr. Sudhakar Yalamanchili Award at ModSim 2023 (link)
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training
Proposed COMET, a holistic methodology for optimizing cluster design and parallelization strategies in distributed training, enabling rapid design space exploration and evaluation of key cluster resource parameters
Contributed as the third author to implement a roofline-based computation model in ASTRA-sim
Work-in-progress
Stealth Research Project
Implemented computation engines in the Structural Simulation Toolkit (SST)
Validated the implementation of the computation engines against analytical models
Found a bug in the SST memHierarchy and contributed to the project by submitting a PR
Work-in-progress
Exploring Memory Expansion Designs for Training Mixture-of-Experts Models
Investigated various memory expansion design options to overcome the GPU memory wall challenge, specifically in the context of training Mixture-of-Experts (MoE) models
Highlighted that remote memory access time and communication time become major performance bottlenecks in MoE model training, but also found that aggressive offloading reduces local HBM memory requirements significantly
Presented at HotInfra 2023
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
ASTRA-sim2.0 expands the capabilities of its predecessor, ASTRA-sim1.0, through the addition of support for (1) arbitrary parallelism, (2) hierarchical network modeling, and (3) memory models that were previously unavailable
Contributed as a co-first author by enabling arbitrary parallelism with a graph-based frontend and adding memory system models
Published in ISPASS 2023
A Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing
Identified the memory bloating problem in prior outer-product-based accelerators
Proposed an inner-product-based sparse-matrix multiplication accelerator that exploits the locality in an inner product
Contributed as the third author to assist with motivational experiments and writing
Published in PACT 2021
Hardware-assisted Trusted Memory Disaggregation for Secure Far Memory
Designed and implemented a secure disaggregated memory system that supports fine-grained memory allocation on a Xilinx FPGA board
Implemented the full HW & SW stacks, which include the FPGA and Linux kernel driver
Led the project as the first author
Adaptive Page Migration Policy with Huge Pages in Tiered Memory Systems
Analyzed the memory access patterns of workloads and proposed an adaptive page migration policy using the accessed bits in page table entries
Led the project as the first author
Published in IEEE Transactions on Computers
Accelerating Critical OS Services in Virtualized Systems with Flexible Micro-sliced Cores
In a virtualized environment, virtual CPUs (vCPUs) suffer from synchronization problems when a vCPU holding a lock sleeps
Solved the synchronization problem in virtualized systems by introducing a CPU pool with a shorter time slice
Contributed as the third author to discuss the idea, implement the controller, and conduct motivational experiments
Published in EuroSys 2018
Improving TLB Translation Coverage under Diverse Fragmented Memory Allocations
Proposed a mechanism to encode page contiguity in page table entries to increase the TLB coverage
Allowed an OS to determine the number of contiguity-encoded page table entries to adapt to various contiguity preferences
Contributed as the second author to discuss the idea, implement the simulator, and conduct motivational experiments
Published in ISCA 2017
Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching
Solved the TLB scaling problem by proposing a virtual caching architecture backed with delayed segment address translation
Contributed as the second author to discuss the idea, implement the simulator, and conduct motivational experiments
Published in ISCA 2016
Patents
"Apparatus and Method for Accelerating Critical Service in Virtualized System", KR 1021157380000, Granted: May 21st, 2020
"Method and System to Improve TLB Coverage by Using Chunks of Contiguous Memory", KR 1019426630000, Granted: Jan. 21st, 2019
Professional Services
Program Committee (Light PC, HPCA 2024)
Artifact Evaluation Committee (MLSys 2022)
Awards & Scholarships
Dr. Sudhakar Yalamanchili Award, ModSim, Aug. 2023
Stars of Tomorrow (Award of Excellence), Microsoft Research Asia, Aug. 2018
Excellent Teaching Assistant Award, KAIST, Mar. 2018
KFAS Scholarship, Korea Foundation for Advanced Studies, 2017-2019
National Scholarship, KAIST, 2014 - 2021
Dean's List, College of Information & Communication Engineering, Apr. 2012, Oct. 2012, Apr. 2013, Oct. 2013
National Scholarship for Science and Engineering, Korea Student Aid Foundation (KOSAF), 2010-2013
Skills
Programming Languages: C, C++, Python, CUDA
System Software: Linux Kernel Development, Linux System Administration
Architecture Simulators: Structural Simulation Toolkit (SST), Gem5, NVMain, Pin, MARSSx86
FPGA: Vivado, Vitis HLS, Verilog, Tcl
Machine Learning Frameworks: PyTorch, TensorFlow
Extracurricular Activities
Student Representative, Department of Computer Science, KAIST, Feb 2016 - Dec 2016
Student, Korea Information Technology Research Institute, Jul. 2013 - Feb. 2014
President, Computer Security Research Club, Sungkuynkwan University, Jul 2011 - Feb 2012
Vice President, Computer Security Research Club, Sungkyunkwan University, Mar 2011 - Jun 2011
Teaching Experiences
Teaching Assistant for Computer Organization, KAIST, Fall 2017
Teaching Assistant for System Programming, KAIST, Spring 2017
Teaching Assistant for Introduction to Computer Application, KAIST, Fall 2015
Teaching Assistant for Digital System and Lab, KAIST, Spring 2015
Teaching Assistant for System Programming, KAIST, Fall 2014
Teaching Assistant for Introduction to Programming (Python), KAIST, Spring 2014
CV updated on 2024-June-24th