Li, Cong's Software Development

My Software Development

I am now a principal engineer in Intel Asia-Pacific R&D Ltd., currently leading the middleware optimization for large-scale deep learning on GPUs or accelerators for datacenters and cloud environments. My recent interests lie in machine learning and data analytics approaches to runtime system management and optimization in various areas such as energy efficiency, reliability, and cluster performance/utilization. My work has been productized in multiple Intel software products including Intel® Extension for PyTorch, Intel® Data Center Manager, Intel® Memory Resilience Technology, Intel® Platform Resource Manager, etc. I have co-authored 21 papers and 2 book chapters in those areas, advancing the state of the art in GPU workload optimization (see, e.g., my paper in SYSTOR '24), memory reliability management (see, e.g., my papers in SC '22, ICCD '20 and '21), buffer cache replacement policies (see, e.g., my papers in Middleware '21, SYSTOR '18 and '19), performance management in cloud environments (see, e.g., my papers in IISWC '19 and '20, ISC '18), and system anomaly detection (see, e.g., my papers in ICDM '18, SEMI-THERM 32).  

Refereed Publications

Cong Li and Yutao Xu (2024). Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference. In Proceedings of the Seventeenth ACM International Systems and Storage Conference (SYSTOR ’24), pp. 53-67. (pdf)

Cong Li, Yu Zhang, Jialei Wang, Hang Chen, Xian Liu, Tai Huang, Liang Peng, Shen Zhou, Lixin Wang, and Shijian Ge (2022). From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell. In Proceedings of the Thirty-Fourth ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '22), pp. 1093-1106. (pdf) (YouTube video) (Bilibili video)

Xiaoming Du and Cong Li (2021). SHARC: Improving Adaptive Replacement Cache with Shadow Recency Cache Management. In Proceedings of the Twenty-Second ACM/IFIP International Middleware Conference (Middleware '21), pp. 119-131. (pdf) (YouTube video) (Bilibili video)

Xiaoming Du, Cong Li, Shen Zhou, Xian Liu, Xiaohan Xu, Tianjiao Wang, and Shijian Ge (2021). Fault-Aware Prediction-Guided Page Offlining for Uncorrectable Memory Error Prevention. In Proceedings of the Thirty-Ninth IEEE International Conference on Computer Design (ICCD '21), pp. 456-463. (pdf) (YouTube video) (Bilibili video)

Xiaoming Du and Cong Li (2021). Predicting Uncorrectable Memory Errors from the Correctable Error History: No Free Predictors in the Field. In Proceedings of the Seventh Annual International Symposium on Memory Systems (MEMSYS '21), article no. 1, pp. 1-10. (pdf) (YouTube video) (Bilibili video)

Li Yi, Cong Li, and Jianmei Guo (2020). CPI for Runtime Performance Measurement: The Good, the Bad, and the Ugly. In Proceedings of the Sixteenth IEEE International Symposium on Workload Characterization (IISWC '20), pp. 106-113. (pdf) (YouTube video) (Bilibili video)

Xiaoming Du and Cong Li (2020). DPCLS: Improving Partial Cache Line Sparing with Dynamics for Memory Error Prevention. In Proceedings of the Thirty-Eighth IEEE International Conference on Computer Design (ICCD '20), pp. 197-204. (pdf) (YouTube video) (Bilibili video)

Huanxing Shen and Cong Li (2020). Runtime Estimation of Application Memory Latency for Performance Analysis and Optimization. In Proceedings of the Sixth Annual International Symposium on Memory Systems (MEMSYS '20), pp. 1-9. (pdf) (YouTube video) (Bilibili video)

Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, and Jing Li (2020). Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data. In Proceedings of the Sixteenth European Dependable Computing Conference (EDCC '20), pp. 41-46. (pdf) (YouTube video) (Bilibili video)

Huanxing Shen and Cong Li (2019). Detecting Last-Level Cache Contention in Workload Colocation with Meta Learning. In Proceedings of the Fifteenth IEEE International Symposium on Workload Characterization (IISWC '19), pp. 14-23. (pdf)

Xiaoming Du and Cong Li (2019). Combining Error Statistics with Failure Prediction in Memory Page Offlining. In Proceedings of the Fifth Annual International Symposium on Memory Systems (MEMSYS '19), pp.127-132. (pdf)

Cong Li (2019). CLOCK-Pro+: Improving CLOCK-Pro Cache Replacement with Utility-Driven Adaptation. In Proceedings of the Twelveth ACM International Systems and Storage Conference (SYSTOR '19), pp. 1-7. (pdf

Xiaoming Du and Cong Li (2019). Combining Global Regression and Local Approximation in Server Power Modeling. SICS Software-Intensive Cyber-Physical Systems, vol. 34, no. 1, pp. 35-43. Springer. (pdf)

Yu-Lin Tsou, Hong-Min Chu, Cong Li, and Shao-Wen Yang (2018). Robust Distributed Anomaly Detection Using Optimal Weighted One-class Random Forests. In Proceedings of the Eighteenth IEEE International Conference on Data Mining (ICDM '18), pp. 1272-1277. (pdf)

Xiaoming Du and Cong Li (2018). Memory Failure Prediction Using Online Learning. In Proceedings of the Fourth Annual International Symposium on Memory Systems (MEMSYS '18), pp. 38-49. (pdf

Huanxing Shen and Cong Li (2018). Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning. In R. Yokota et al. (Eds.) Proceedings of the Thirty-Third International Conference on High Performance Computing, ISC High Performance 2018, LNCS 10876, pp. 144–162, Springer. (pdf)

Cong Li (2018). DLIRS: Improving Low Inter-Reference Recency Set Cache Replacement Policy with Dynamics. In Proceedings of the Eleventh ACM International Systems and Storage Conference (SYSTOR '18), pp. 59-64. (pdf

Cong Li, Jia Bao, and Haitao Wang (2017). Optimizing Low Memory Killers for Mobile Devices Using Reinforcement Learning. In Proceedings of the Thirteenth International Wireless Communication and Mobile Computing Conference (IWCMC '17), pp. 2169-2174. (pdf

Cong Li, Huanxing Shen, and Tai Huang (2016). Learning to Diagnose Stragglers in Distributed Computing. In Proceedings of the Ninth Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS@SC '16), pp. 1-6. (pdf

Cong Li, Abishai Daniel, and Nishi Ahuja (2016). A New Approach to Predict Fan Failures with Fan Speed Correlation. In Proceedings of the Thirty-Second Annual SEMI-THERM Thermal Measurement, Modeling and Management Symposium (SEMI-THERM 32), pp. 90-94. (pdf

Cong Li (2016). Cooling Anomaly Detection for Servers and Datacenters with Naïve Ensemble. In Proceedings of the Thirty-Second Annual SEMI-THERM Thermal Measurement, Modeling and Management Symposium (SEMI-THERM 32), pp. 157-162. (pdf

Book Chapters

Jamel Tayeb, Cong Li, and Chang Seok Bae (2011). Writing Energy-Efficient Software for the Data Center. In Bob Steigerwald, Chris Lucero, Abhishek Agrawal, and Chakravarthy Akella (Eds.), Energy Aware Computing: Powerful Approaches for Green System Design, Intel Press.

Murali Rajappa, David Filani, Enrique Castro-Leon, Andy Hoffman, Robin Steinbrecher, Dror Shenkar, Cong Li, and Tianfei Zhu (2010). Platform Assisted Thermal Management of Data Centers. In Enrique Castro-Leon, Bernard Golden, and Miguel Gomez (Eds.), Creating the Infrastructure for Cloud Computing: An Essential Guide for IT Professionals, Intel Press.

Power Management in Data Centers

Some of my previous work focuses on modeling and solving problems of dynamic power management in Intel(R) Data Center Manager which provides the technology of power and temperature monitoring and management for servers, racks and groups of servers in datacenters.

Other Manageability Software

Intel Platform Administration Technology (code name Christea) is a complicated manageability solution for i-Cafe market providing features as system provisioning, disk protection, asset management, etc. The solution was initially designed based on firmware capability on the client side and later transitioned to a pure software solution with embedded Linux as the pre-OS environment.

Intel System Recovery Tool (code name Leto) is a backup/recovery solution which was previously embedded in BIOS and is now integrated in bootable USB keys.

Intel Education Administrator (code name Hat Point) is a complicated manageability solution for education market evolved from Christea, providing functions of system provisioning and recovery, software installation, education shell, etc.

Testing Manageability Solutions

In Christea, Leto, and Hat Point there are lots of components including pre-OS evnironment including drivers as well as applications, OS drivers, and OS client applications at the client side, and web UI, database, and OS services at the server side. The critical areas in the software include data consistency and integrity, stablility of the low level components (network stack and ATA driver in pre-OS environment, OS driver), server side and client side performance, network compatibility, hard disk compatibility, etc.

To test the critical areas efficiently and effectively, we employed the following methodologies in testing

My Talks on Software Testing

My Understanding on Product Testing  (Note: The talk only covers the testing of system behaviors with respect to specifications, but does not cover the verification of products with respect to users' needs. In the triangle problem example, overflow in signed integer calculation was missed. Thanks to the feedback from Microsoft HPC test team in Shanghai.)

Christea 3.1 Defect Analysis (WW07.4): A Case Study in Analyzing and Mining Defects (pdf)

The Mathematics of Testing: From the Theoretical Model to the Practical Approach