Fun of Predicting the Future

Jian Li, PhD aka. Poor hungry Determined or Permanent head Damage or Pile higher Deeper.

Work: NB aka. NoBody of Research and Technology Planning, Futurewei Technologies

Hobby: TRIF (technologist researcher investor forecaster) as an adapter of immune response

E M A I L : jian.li AT futurewei DOT com ; lijianathome AT yahoo DOT com

DISCLAIMER

Views, thoughts, and opinions expressed in the presentation belong solely to the author or public domain references as cited, and not necessarily to the author’s current or former employers, organizations, committees or other related groups or individuals.

Trademarks, logos, etc. belong to their corresponding parties.

Both technology planning and financial investment take risks. One is responsible to its/his/her/theirs/etc.'s own.

INTRODUCTION

Dr. Jian Li is a Sr. Director of Research and Technology Planning at Futurewei Technologies Inc. Prior to this position, he was a research scientist at IBM Research. While with IBM, he also spent over three years on an international assignment as an executive architect at IBM Growth Markets Unit, and was previously a chief architect of Big Data Systems at IBM Greater China Group. He holds a Ph.D. degree from Cornell University, a B.S. degree from Tsinghua University. He is a senior member of both IEEE and ACM.

His technical team has made industrial impacts in big data systems, AI systems, autonomous driving, and other emerging technologies. They participated and won 5 out of total 7 prizes in the Low Power ImageNet Recognition Challenge. More details at http://lpirc.net or IEEE Reboot Computing http://rebootingcomputing.ieee.org/

His prior research had centered on Big Data & Analytics platforms and solutions, such as Apache Hadoop, SPARK, Twitter STORM, IBM BigInsights and InfoSphere Streams, etc on x86, PowerPC and other platforms. A short IBM Research Technical Report summarizes team effort is available here: "Understanding System and Architecture for Big Data", IBM Research Report, RC25281.

He has worked in the areas of architectural support for power- and variation-aware computing, interconnection network design for high-performance computing systems, workload-driven three-dimensional (3D) integration architecture, architectural applications of non-volatile memory (NVM) and storage class memory (SCM), energy-efficient interconnection networks, data center networks, workload optimized systems.

He holds an adjunct position at the Texas A&M University and collaborates with Professor Lawrence Rauchwerger and other faculty. For details on his university collaborations, please check out the "University Collaborations" section of this page.

This web page is intended to record his professional effort in the public domain only. If interested in MU LLC's consulting and/or investment services, please contact lijianathome@yahoo.com . Thanks!

PUBLICATIONS

- More but stopped tracking
- More in 2019
- More in 2018
- 5 papers in 2017
- “Big Data Optimization”, to appear as a book chapter, in Springer series of “Studies in Big Data”.
- “Understanding Architectural Characteristics of Multimedia Data Retrieval and Analytics”, to appear in IBM Journal of Research and Development.
- “China Big Data White Paper (2013)”, co-author with other expert members of China Computer Federation Task Force on Big Data (CCF TFBD), January 2014.
- High Throughput Computer Data Center Architecture, White Paper, Futurewei Technologies (First appeared at ISCA 2014 conference).
- Breaking the Boundary for Whole-System Performance Optimization of Big Data. Yan Li, Kun Wang, Qi Guo, Xin Li, Xiaochen Zhang, Guancheng Chen, Tao Liu, Jian Li. In proceedings of International Symposium on Low Power Electronics and Design (ISLPED), Beijing, China, Sept 4-6, 2013.
- Understanding system design for Big Data workloads. Hofstee, H.P. ; Chen, G.C. ; Gebara, F.H. ; Hall, K. ; Herring, J. ; Jamsek, D. ; Li, J. ; Li, Y. ; Shi, J.W. ; Wong, P.W.Y. IBM Journal of Research and Development. Issue 3/4. May-July 2013
- Scalable community detection in massive social networks using MapReduce. Shi, J. ; Xue, W. ; Wang, W. ; Zhang, Y. ; Yang, B. ; Li, J. IBM Journal of Research and Development. Issue 3/4. May-July 2013
- DuSCA: A Multi-Channeling Strategy for Doubling Communication Capacity in Wireless NoC. Yi Wang, Danella Zhao and Jian Li, The 30th IEEE International Conference on Computer Design (ICCD), September 30 - October 3, 2012. Montreal, Quebec, Canada.
- Cong Liu, Jian Li, Wei Huang, Juan Rubio, Evan Speight, Xiaozhu Felix Lin, "Power-Efficient Time-Sensitive Mapping in CPU/GPUHeterogeneous Systems", to appear in proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT 2012), September 19-23, Minneapolis, MN.

Anne E. Gattiker, Fadi H. Gebara, Ahmed Gheith, Guan Cheng Chen, H. Peter Hofstee, Damir A. Jamsek, Jian Li, Ju Wei Shi, Evan Speight, Peter W. Wong. "Understanding System and Architecture for Big Data" IBM Research Report, RC25281 (AUS1204-004), April 23, 2012.
- Zhenman Fang, Weihua Zhang, Haibo Chen (Fudan University), Jian Li, et al, "Transformer: An Extensible, Fast and Cycle-Accurate Full-system Multi-core Simulator ", In proceedings of The Design Automation Conference (DAC), June 3-7, 2012, San Francisco, CA.
- Chongmin Li, Dongsheng Wang, Haixia Wang, Yibo Xue (Tsinghua University), Jian Li, "Proximity-Aware Cache Replication", In proceedings of 17th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 30- Feb. 2, 2012, Sydney, Australia.
- Elnozahy, E.N., Speight, E., Li, J., Rajamony, R., Zhang, L., Arimilli, L.B. "PERCS System Architecture", Encyclopedia of Parallel Computing, Springer Verlag, pp. 1506-1515, 2011.
- Chongmin Li, Haixia Wang, Yibo Xue, Dongsheng Wang (Tsinghua University), Jian Li, "Scalable Proximity-Aware Cache Replication in Chip Multiprocessors (short paper)", To appear in the Twentieth International Conference on Parallel Architectures and Compilation Techniques (PACT), Galveston Island, TX, October 10-14, 2011.
- TAPO: Thermal-Aware Power Optimization Techniques for Servers and Data Centers. Wei Huang, Malcolm Allen-Ware, John Carter, Elmootazbellah Elnozahy, Hendrik Hamann, Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani and Juan Rubio, In the proceedings of International Green Computing Conference (IGCC), 2011 (Best Paper Award).
Dan Zhao, Yi Wang, Jian Li and Takamaro Kikkawa, "Multi-Channel Wireless Network-on-Chip: A New Approach to Improving On-Chip Communication Capacity", International Symposium on Networks-on-Chip (NOCS), May 1-4, 2011.
- Jian Li, Wei Huang, Lixin Zhang, Charles Lefurgy, Wolfgang Denzel, Richard Treumann and Kun Wang, "Power Shifting in Thrifty Interconnection Networks", To appear in International Symposiun on High Performance Computer Architecture (HPCA), February 12-16, 2011.
- Hsiang-Yun Cheng, Jian Li, Chung-Hsiang Lin, Chia-Lin Yang and Ram Rajamony, "Memory Latency Reduction via Interference-Aware Task Throttling", To appear in the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2010
- Baba Arimilli, Ravi Arimilli, Vicente Chung, Scott Clark, Wolfgang Denzel, Ben Drerup, Torsten Hoefler, Jody Joyner, Jerry Lewis, Jian Li, Nan Ni, Ram Rajamony, "The PERCS High-Performance Interconnect. " HotInterconnect (Hoti) '2010: Proceedings of the 18th Annual Symposium on High-Performance Interconnects.
- X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony and Y. Xie, “Design Exploration of Hybrid cache architecture with disparate memory technologies”, To appear in ACM Transaction on Architecture and Code Optimization (TACO), 2010.
- X. Wu, G. Sun, X. Dong, R. Das, J. Li, Y. Xie and C. Das, “Cost-driven 3D Integration with Interconnect Layers”, To appear in 47th Design Automation Conference (DAC), Anahelm, CA, June 13-18, 2010.
- H.-Y. Cheng, J. Li, C.-L. Yang, “An Analytical Model to Exploit Memory Task Scheduling”, To appear in INTERACT-14: Workshop on Interaction between Compilers and Computer Architecture and ACM DL, held in conjunction with ASPLOS 2010, March 13-17, Sheraton Station Square, Pittsburgh, PA, USA.
- W. E. Denzel, J. Li, P. Walker and Y. Jin, “A framework for end-to-end simulation of high-performance computing systems”, In Simulation: Transactions of the Society for Modeling and Simulation International, 2009. (Selected as one of the best conference papers at SimuTools 2008)
- J. Li, L. Zhang, C. Lefurgy, R. Treumann and W. Denzel. "Thrifty Interconnection Networks for HPC systems", poster abstract in the proceedings of International Conference on Supercomputing (ICS) 2009, Yorktown Heights, NY, June 8-12, 2009.
- X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony and Yuan Xie. "Hybrid cache architecture with disparate memory technologies", in Intl. Symp. on Computer Architecture (ISCA) 2009, Austin, TX, June 20-24, 2009. (Opening Session)
- X. Wu, J. Li, E. Speight, L. Zhang, and Yuan Xie. "Power-Performance evaluation of read-write aware hybrid caches", in Design, Automation & Test in Europe (DATE) 2009, Nice, France, April 20-24, 2009.
- G. Sun, X. Dong, Y. Xie, J. Li and Y. Chen, “A Novel Architecture of the 3D Stacked MRAM L2 Cache for CMPs”, In International Symposium on High-Performance Computer Architecture 2009 (HPCA) Raleigh, North Carolina- February 14-18, 2009.
- X. Wu, J. Li, L. Zhang, E. Speight, and Yuan Xie. "Power and performance evaluation of 3D hybrid cache with non-volatile memory", HPCA 2009 3D Workshop, Raleigh, North Carolina, Feb. 14, 2009.
- J. Li and W. E. Denzel, “A framework for end-to-end simulation of high-performance computing systems and its application to PERCS design”, In IBM Academy of Technology - Performance Engineering 'Best Practice' Topical Conference, Southbury, CT, June 2008.
- W. E. Denzel, J. Li, P. Walker and Y. Jin, “A framework for end-to-end simulation of high-performance computing systems”, In Intl. Conf. on Simulation Tools and Techniques for Communications, Networks and Systems (SimuTools), Marseille, France, March 2008. (Opening Session)
- J. Li and J. F. Martínez, “Dynamic power-performance adaptation of parallel computation on chip multiprocessors”, In International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, February 2006.
- J. Li and J. F. Martínez, “Power-performance considerations of parallel computing on chip multiprocessors”, ACM Transaction on Architecture and Code Optimization (TACO), December 2005.
- J. Li and J.F. Martínez, “Power-performance implications of thread-level parallelism on chip multiprocessors”, In International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, March 2005. Earlier version appears in First Watson Conference on the Interaction between Architecture, Circuits, and Compilers (P = ac2), Yorktown Heights, NY, October 2004.
- J. Li, J.F. Martínez, and M.C. Huang, “The thrifty barrier: Energy-efficient synchronization in shared-memory multiprocessors”, In International Symposium on High Performance Computer Architecture (HPCA), Madrid, Spain, February 2004. (Opening Session)
- S. Lloyd, J. Li, J. Peckham, and Q. Yang, “Simultaneous database backup using TCP/IP and a specialized network interface card”, Book chapter in Advanced Topics in Database Research, Vol. 4, 2004.
- S. Lloyd, J. Peckham, J. Li, and Q. Yang, “RORIB – An economic and efficient solution for real-time online remote information backup”, Journal of Database Management (JDBM), 14 (3): 56-73, Jul. – September 2003

PROFESSIONAL ACTIVITIES

- More but stopped tracking
- More in 2019
- More in 2018
- More in 2017
- External Review Committee, MICRO 2016
- IoT Designer Track Chair, DAC 2016
- Industry Track Chair, HPCA 2016
- External Review Committee, HPCA 2016
- Keynote speech, Designer Track, ICCAD, November 2015
- Keynote speech, Hardware and Algorithms for Learning On-a-Chip (HALO) workshop, November 2015
- Keynote speech, West Lake Health Summit, October 2015
- Industry & Government Program Co-chair. IEEE Big Data Conference 2015
- External Review Committee, MICRO 2015
- Program committee, ICDCS 2015
- Program committee, NPC 2015
- External Review Committee, ISCA 2015
- External Review Committee, HPCA 2015
- The 11th IFIP International Conference on Network and Parallel Computing (NPC), Ilan, Taiwan, Sept 18-20, 2014.
- General Co-chair. IEEE ISPA Conference 2014.
- Keynote speech. “Technology Overview and Trend”, NYU US-China Relations Workshop. Dec. 2014.
- Keynote speech. “High Throughput Computing Data Center Architecture,” Open Server Summit, Santa Clara, Nov. 11-13, 2014.
- Keynote speech. “High Throughput Computing Data Center Architecture,” IEEE ISPA Conference, Milan, Italy, August 2014.
- Steering Committee member, Invited talk, "High Throughput Computing Data Center Architecture", Fourth Workshop on Architectures and Systems for Big Data (ASBD), held in conjunction with The 41th International Symposium on Computer Architecture (ISCA-2014), Minneapolis, MN, June 14-18th, 2014.
- Steering Committee. The IEEE International Conference on Networking, Architecture and Storage (NAS), TianJin, China, 2014.
- General Co-Chair. The 12th IEEE International Symposium on Parallel and Distributed Processing with Applications, Milan, Italy, 2014.
- Program Committee Co-Chair. The IEEE International Conference on Networking, Architecture and Storage (NAS), Xi'an, China, July 17-19, 2013.
- Co-organizer. Third Big Data Benchmarking Workshop, Xi'an, China, July16-17, 2013.
- Co-organizer. Third Workshop on Architectures and Systems for Big Data (ASBD), held in conjunction with The 40th International Symposium on Computer Architecture (ISCA-2013), Tel-Aviv Isareal, June 23-27th, 2013.
- Keynote talk. Second International Workshop on Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments (CHANGE), held in conjunction with The 49th Design Automation Conference (DAC-2012), San Francisco, CA, June 3, 2012.
- Co-organizer. Second Workshop on Architectures and Systems for Big Data (ASBD), held in conjunction with The 39th International Symposium on Computer Architecture (ISCA-2012), Portland, Oregon, June 9-13th, 2012.
- Co-organizer, First Workshop on Architecture and Application Exploration of Micro-Server Systems, Held in conjunction with the 18th International Symposium on High Performance Computer Architecture(HPCA-2012)), New Orleans, Louisiana, February 25-29, 2012.
- Invited talk, Fudan University, Shanghai, 07/28/2011.
- PC member. IEEE International Parallel & Distributed Processing Symposium (IPDPS) (IPDPS), Shanghai, China, May 21-25, 2012.
- Invited talk, Tsinghua University, Beijing, China, 06/10/2011.
- Invited talk, Institute of Computing Technologies, Beijing, China, 06/14/2011.
- PC member. International Conference on Systems and Networks Communications (ICSNC), Barcelona, Spain, October 23-28, 2011.
- Co-organizer. First Workshop on Architectures and Systems for Big Data (ASBD), held in conjunction with the The Twentieth International Conference on Parallel Architectures and Compilation Techniques (PACT-2011), Galveston, TX, October 11, 2011.
- PC member. IEEE International Conference of Soft Computing and Pattern Recognition (SoCPaR), Dalian, China, October 14-16, 2011.
- Finance Chair and PC member. IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, October 23-25, 2011.
- Program Committee Co-Chair of Architecture Track. The IEEE International Conference on Networking, Architecture and Storage (NAS), Dalian, China, July 28-30, 2011.
- Invited talk, ECE, Texas A&M University, 04/15/2011.
- Invited talk, Institute of Computing Technologies, Chinese Academy of Sciences, 12/20/2010.
- Invited talk, IBM China Research Lab, 12/17/2010.
- Invited talk, Harbin Institute of Technology, 12/14/2010.
- Invited talk, The Center for Advanced Computer Studies (CACS), University of Louisiana at Lafayette, 11/19/2010.
- Invited talk, CSE, Texas A&M University, 10/20/2010.
- Industry Liason Co-chair. The 17th International Symposium on High Performance Computer Architecture (HPCA). San Antonio, Texas, February 12-16, 2011.
- Session Chair. The 2010 International Symposium on Low Power Electronics and Design (ISLPED). Austin, Texas, August 18-20, 2010.
- Co-organizer. The 2nd Workshop on Emerging Memory Technologies (WEMT). Held in conjunction with International Symposium on Computer Architecture 2010 (ISCA), Saint-Malo, France, June 19-23, 2010.
- Program committee member. The 5th IEEE International Conference on Networking, Architecture and Storage (NAS), the University of Macau, Macau SAR, China, July 15-17, 2010.
- Invited talk, ECE, Carnegie Mellon University, 01/26/2010.
- Invited talk, CSE, Penn State University, 01/25/2010.
- Program committee member. The IEEE Computer Society Annual Symposium On VLSI (ISVLSI), Lixouri Kafalonia, Greece, July 5-6, 2010.
- Host for Professor Hyesoon Kim, CoC, GaTech, 10/06/2009.
- Invited talk, CSE, Texas A&M University, 10/05/2009.

Co-organizer and session chair, Workshop on Emerging Memory Technologies (WEMT). Held in conjunction with International Symposium on Computer Architecture 2009 (ISCA), Austin, TX, June 20-24.
- Host of IBM ARL Technical Seminar for Professor Huiyang Zhou, ECE, NCSU, 06/23/2009.
- Host of IBM Research Arch PIC Seminar for Professor Onur Mutlu, ECE, CMU, 06/19/2009.
- Invited talk, Division of Engineering, Brown University, 06/11/2009.
- Invited talk, ECE, Texas A&M University, 03/10/2009.
- Host of IBM Research Arch PIC Seminar for Professor Mattan Erez, ECE, UT Austin, 02/13/2009.
- Program committee member. Workshop on 3D Integration and Interconnect-Centric Architectures. Held in conjunction with International Symposium on High-Performance Computer Architecture 2009 (HPCA) Raleigh, North Carolina, February 14-18, 2009.
- Program committee member. The IEEE Computer Society Annual Symposium On VLSI (ISVLSI), Tampa, Florida, USA, on May 13-15, 2009.
- Host of IBM Research Arch PIC Seminar for Professor Yuan Xie, CSE, PSU, 09/26/2008.
3D tutorial, International Symposium on Computer Architecture (ISCA), Beijing, China, 06/212008.
Host of ARL Distinguished Seminar for Professor Narasimha Reddy, ECE, TAMU, 05/01/2008.
Program committee member. International Conference on Networking, Architecture, and Storage (NAS), Chongqing, China. 06/2008.
Program committee member. The 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN), Sydney, Australia. 05/2008.

PATENTS

- More but stopped tracking
- More in 2019
- More in 2018
- More in 2017
- More in 2016
- More in 2015
- More in 2014
- More in 2013
- yet more for Big Blue June, 2012.
- yet a few more for Big Blue March, 2012.
- yet a few more for Big Blue, July 2011.
- 2 more for Big Blue, Oct.-Nov. 2010.
- 5 more filed for Big Blue, May 2010.
- J Li, S P VanderWiel, L Zhang. "On-chip networks for flexible three-dimensional chip integration" United States 8,386,690. Issued February 26, 2013
- J Li and W Speight. "Techniques for dynamically sharing a fabric to facilitate off-chip communication for multiple on-chip units" United States 8,346,988. Issued January 1, 2013
- J Li, W Speight, L Zhang. "Reducing energy consumption of set associative caches by reducing checked ways of the set association" United States 8,341,355. Issued December 25, 2012.
- J Li, L Zhang. "Link services in a communication network". United States Patent 8310936. Issued November 13, 2012.
- J Li, R Rajamony, W Speight, L Zhang. "Read and write aware cache storing cache lines in a read-often portion and a write-often portion" United States Patent 8271729. Issued September 18, 2012.
- J. Li. "Reconfigurable Cache", United States Patent 8,230,176, July 24 2012.
- X. Wu, J. Li, R. Rajamony, W. Speight and L. Zhang, "Improvemed non-uniform cache architecture", United States Patent, file pending, November 2008.
- Q. Yang and J. Li, "Remote online information backup system", United States Patent 7177887, February 2007.

University Collaborations

Dr. Jian Li has been fortunate to work with the following highly-talented graduate students and be in touch with their advisers and research groups.

- Xiaozhu Felix Lin, 2011, now faculty at Purdue University [Lin Zhong, Rice University]
- Cong Liu, summer 2011, now faculty at UT Dallas [James H. Anderson, University of North Carolina]
- Chris J. Craik, summer 2010, now at Google [Onur Mutlu, Carnegie Melon University]
- Doe Hyun Yoon, 2010, now at ARM - [Mattan Erez, UT Austin]
- Hsiang-Yun Cheng, 2009-2010 [Chia-Lin Yang, National Taiwan University]
- Xiaoxia Wu, summer 2008, now at Qualcomm - [Yuan Xie, Penn State University]
- Guangyu Sun, 2008, now faculty at Peking University - [Yuan Xie, Penn State University]
- Yuho Jin, summer 2007, now at USC - [EJ Kim, Texas A&M]

He also holds an adjunct position at the Texas A&M University and collaborate with Professor Lawrence Rauchwerger and other faculty.

Last update: 2021

More recent edits in the following. The opinions and views I express below or anywhere on or related to this web site are mine, and not my current or prior employers', or anyone else's. Trademarks, logos, etc. belong to their corresponding parties. Comments truly welcome!

My Study Notes - Open to All Creatures Great and Small:

。SoC4CG 202408-01: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202407-03: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202407-02: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202407-01: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202406-02: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202406-01: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202405-02: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202404-1: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202404-01: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202403-1: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202402-01: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202401-1: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202312-2: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress) - including predictions of 2024 :)

。SoC4CG 202312-1: https://docs.google.com/presentation/d/31411KqVNlZPwDvtqud2C-g8EOExZBt4-klS0VkKcn17ARA0/edit?usp=sharing (In-progress)

。SoC4CG 202311-2: https://drive.google.com/file/d/1cg-XpSpNndynizNbQwKwT98vqSqiu_Q9/view?usp=sharing (Frozen)

。SoC4CG 202311-1：https://drive.google.com/file/d/1NHrF2-SL4NacDTs-K-S8_WyQ_RwQfNjU/view?usp=sharing （Frozen)

。SoC4CG 202310-2: https://drive.google.com/file/d/1lHnZ3roUX8tUMz_Ive7ZIbXY11za1Rv7/view?usp=sharing (Frozen)

。SoC4CG 202310-1: https://drive.google.com/file/d/1jk5V-zcY6Le_TgKMSXmN98wPW65MgRY2/view?usp=sharing (Frozen)

。Sparks over Coffee for Common Good (SoC4CG), 202309 : https://drive.google.com/file/d/1ffM3e300UUFmXwwYyLWPzxmOy1xYQTGQ/view?usp=sharing (Frozen)

。在Neurips大会看到这个做DL软件加速的公司，https://www.hpc-ai.tech/ 。创始人是新加坡国立大学的尤洋教授和他在伯克利的导师，https://www.comp.nus.edu.sg/~youy/，伯克利毕业的。在前两周的Supercomputing也看到过他们的工作。我个人觉得可以让HPC-AI Tech做个poc，纯软件的，应该见效很快，https://github.com/hpcaitech/ColossalAI

。另外，在Neurips会上发现一个ARM的文章，和MIT 韩松的TinyML训练有共性目标，但只做算法优化，不是软硬件协同。但是，比MCUNet 多做了 bitwidth和sparcity的优化。我也问了韩松这边，他们说因为已经达到优化目标了，所以没有做bitwidth和sparcity的优化。链接：

https://nips.cc/virtual/2022/poster/55251

https://openreview.net/forum?id=ZJe-XahpyBf

。貌似Hinton泰斗和谷歌研究院长兼实干家Jeff Dean，都有类似想法，即，learning based Reconfigurable systems。也许，他们目前都是谷歌的，有一定默契。但这个方向的一些技术点也被MIT等学界研究，论证和支撑了。

。SC会上和Intel的人交流时，他们自己也很骄傲PolarFly这个作品。另外，他们参与了，AGILE项目，应该大概率有落地计划。有一点是: 这种dragonfly变种的拓扑，会对服务器网络布线有一定难度，因为光纤长短不一，等等。但，这都不是新问题。也可控，因为PolarFly数学上很优美，只需要工程人员适应一下即可。Polarfly和HammingMesh都是为了CNN，GNN等DL类新应用的DragonFly变种。Torsten Hoefler的一页SC会议上的胶片基本是这类网络拓扑技术的历史沿革。没想到我原来在IBM做的PERCS文章还是和Torsten Hoefler合作的。他当时在UIUC做我们的IBM BlueWater的售后支持工作。我们是甲乙方关系，写完论文就自然把他的名字放上去了，当时也没留意

。https://groq.com/press/ Groq是个谷歌TPU团队出来的人弄的公司，现在又相当的政府HPC和金融界高频交易客户。他们用的也是变种的DragonFly。 https://www.youtube.com/watch?v=mUsBORr-T8E 这个报告是网络和调度部分。这些工作发表在年中的体系结构顶会ISCA 2022

。这个SC的topo aware network panel我也去听了，确实很实用，因为主要是业界的人，相关胶片的照片在楼上提到的谷歌云盘里：https://drive.google.com/drive/folders/1OhvChtn3olayRqCNYlvbHwO_B8clQTpB

其中，我觉的谷歌的 Brian Towles的报告虽然不好看，但非常切中要害。他在问答环节说的也不多，但字里行间很到位，值得细细品味下。我想原因是这样：他是斯坦福bill daly那里出来的（楼上提到过Bill Daly和Peter Kogge的关于Exascale Computing的总结报告，也很干货），对dragonfly很了解，还和bill daly出了一本网络实践的教科书，在DE Shaw的生物制药超算中心一直做domain specific 网络（这些工作都是比其它人早近10年就看是的工作，很多学界的教授根本没有概念），然后在谷歌的TPU团队也还是做专用网络。可以说，他的理论和实践功底，在panel里是最牛的。但，人很低调。所以，还是要细听的他语句。总体是：基于他的研究和实践经验，domain specific网络是毋庸置疑的，这里包括 topology和相关参数设计和调优。

这块我在panel后问了一下周围的朋友。可能主要是业界的人在上面这个panel上，相关的学界教授做的很多因为资源不足，有价值的论文少些

。这个中山大学和超算中心等的ICS 2022论文，Optimized MPI collective algorithms for dragonfly topology，我觉得可以看作topo aware的一个软件实践。而且，MPI Collectives的网络通讯方式其实很多和DL训练时出现的规律和方式是一样的。这里有个油管报告视频：https://www.youtube.com/watch?v=Gu2Tp-G9LyA 其中一些技术和理念，可以用在HammingMesh，PolarFly等DragonFly变种上

。Jack在SC的图灵奖报告也是一样的逻辑。

或者，换个角度看：深度学习的AI负载，图计算hpc-g的负载，和传统HPC的Linpak（hpc-l）负载是不同的。以传统hpc-l为设计中心的系统，面对ai和g的负载，其效率很低是正常的，一般在1-100甚至到1-1000，主要原因是多线程处理器和内存访问的差距，即，数据搬运的需求差距。

所以，目前很多人是考虑对ai和g负载的系统设计和优化，例如ETH的Onur Mutlu文章里提到的从PIM入手，很大地减少处理器对内存的访问，因为很多计算在内存里或附近做了。加上其它硬件加速器和片上IO的结合，计算离数据会更近。但，这还不够。如ETH的Onur的文章所示，即使是AI的负载，也有不同的需求和相应优化。因此，这就回到您上面提到的DARPA的SDH（sw defined hw）的项目，其实质我认为是，粗粒度的可重构体系结构，其可重构是由软件来指挥的，例如NVDIA的Symphony项目（实际是Steve Keckler和Doug Burger两个人，还在UT Austin做教授时的TRIPS项目的延申）。

其实把这些软件定义和可重构的理念，和云计算放在一起是很完美的。云计算的本质是服务，即用户和编程人员一般不需要知道很多硬件细节，只要好用就行。在好用的前提下，通过高级语言甚至脚本控制下面硬件对不同负载的粗粒度重构，正好满足DARPA的SDH项目的纸面意义上的需求。或者说，云也会是 domain specific clouds，会有类似SDH的DSL和NVDIA Symphony等类似的domain specific architecture（DSA）架构和设计。这里面最难的恐怕是网络架构的设计，及其和内存访问和计算单元的协处理。

。陆奇今天在哈佛和MIT的报告。另有他去年末在中关村的报告。同一个主题。报告本身没有太多意外和项目直相关性。但是，在哈佛和MIT的这两场报告里，他都极力推荐openai 的 co-pilot (和将来的auto-pilot) 和 ChatGPT。其实最近openai的软件包的热度飙升。所以在想，可否应该把openai的软件栈也放入咱们的评测应用集之中? 这方面的应用应该会在AI领域很快很有代表性。

开源可获取行的问题，我想了想：

从溯源来看，OpenAI目前它可以是个Model as a Service（MaaS）的云平台（这一点好像陆奇的一页胶片有涉及）。训练OpenAI的模型，又可以是一个 Training as a Service（TaaS）的云服务。Maas和TaaS都应该可以为我司所用。由于OpenAI的GPT和Codex（Co-Pilot）等model services和API的迅速疯行，和将来的好势头，咱们要储备系统级优化技术，针对OpenAI这种大模型的训练TaaS云服务优化，及其MaaS的云服务优化。这就回到您的忧虑，即OpenAI目前不把API开放给中国，gpt3等也没有开源，那咱们做系统优化和设计的怎么把这块典型市场应用放到咱们的benchmark suite里来？

这确实是个问题。但看看OpenAI的底层技术和沿革，不难发现，OpenAI其实没有算法等方面的重大突破。它的特点是现有算法的大模型和大数据的拓展，及相应的工程实现。这貌似很土，也曾经不被很多学者和业界人士看好；但事实证明解决普适的基本问题，例如语言生成chatgpt，编程辅助codex。回到咱们的诉求，如果各种原因不能直接获得其开源代码，模型等等，可以用其它类似的大公司如NVDIA和微软等做的开源软件替代来“仿真”OpenAI对系统的运行特点，再做针对性的优化。这样，从系统的角度，优化效果大概率是类似的。

因此，针对：

1. Training as a Service，这可以是并行计算parallel computing的典型云服务范例，即您说的AI大模型的训练的开源可获取性：如上所述，因为OpenAI没有算法突破，咱们可以以NVDIA的大模型Megatron @ https://github.com/NVIDIA/Megatron-LM 为出发点。原因是NVDIA是主要是系统公司，他们要卖系统赚钱，所以他们的开源是真心实意的。进而，更大的模型训练开源项目还有微软的基于Megatron的DeepSpeed @ https://www.microsoft.com/en-us/research/project/deepspeed/ 和 https://github.com/microsoft/DeepSpeed 。另外，欧洲那边的Bloom @ https://huggingface.co/blog/bloom-megatron-deepspeed 是在Megatron和DeepSpeed的基础上的更大的模型训练，甚至都比GPT-3的训练模型还大，所以都有计算内核或算子的代表性。可以根据咱们的条件适当取舍。

2. Model as a Service，这可以是多应用计算multiprgrammed workload的典型云服务范例，即chatgpt，codex等。这些模型的基本computing kernels应该和网上能搜到的开源模型，从计算机系统的角度没有太大区别，即一堆其它类似开源模型放在一起跑，咱们来优化，应该和chatgpt，codex等OpenAI模型放在一起跑对计算机系统的要求类似。这里面，单独线程差异有多大，说实话我还没有很大把握，但多线程组合起来，对multiprogrammed 负载分析，应该还是有统计意义的。

。单模型训练成本降低的趋势是对的，但整体TaaS的市场会因为训练成本门槛低了以后催生更多的市场需求而继续发展，单一成本 X 数量 = 整体市场。类似PC的盛行，基因测序的盛行，有点像云计算的IaaS。在IaaS之上，还会有MaaS的发展，就像PaaS。然后带动相应的应用生态，就像SaaS。希望很多大厂能很快出自己的gpt和openai平台，和openAI竞争降成本增应用，这样才能更快地催生这个市场。另外从整体成本的角度，训练成本是一个环节，（1）之前的数据清洗（貌似目前主要需要人工），（2）和之中的数据标注（需要专业know-how，有部分自动化，如Tesla的AD视频标注），这两块如果也都自动化和规模化，也是一部分市场需求我觉得，但可以一步一步来

。陆奇举例子时说，微软非常重视邮件软件Outlook和它的Teams交互软件的业务，这就是从微软角度您提到的泛链接，有了office的流量才能保证微软的产品生态
。Gates, who cofounded Microsoft in 1975, believes that new robots like ChatGPT are capable of training, improving, and reading and writing through new knowledge（知识）. He said that AI will improve messaging software like Microsoft's Team. URL： https://www.thestreet.com/technology/bill-gates-reveals-the-next-big-thing?puc=yahoo&cm_ven=YAHOO

。周一在三藩的ISSCC听了AMD的Lisa Su的主题报告。油管上的视频也出来了：https://www.youtube.com/watch?v=DxAL7MGiWGs 。我个人最喜欢后面这几页，即：基于AI性能的加速提升，可以考虑用 AI计算来替代传统HPC领域的物理建模迭代计算（AI Surogates Physics Models），从而用AI来提高HPC性能。这里和系统领域的 run-ahead execution 和 value prediction 等技术有关联性。即，用速度快的AI模型的inference结果来逼近传统物理建模的计算结果。但，Lisa没有提到的是如果两者之间差别加大，那怎么办？这一点，我们之前有个专利有解决方案，即把 AI计算做为传统HPC计算的预测或（近似计算）加速器的话，如何让两者协调同时获得高性能和可靠结果。这个专利是2021年递交申请给专利局的，已经是公开可获取的文件

。ISSCC会上看到这个Graphcore的报告，觉得不错。个人认为用它来做大模型训练和推理应用的性能预估还是挺靠谱的。这样，可以和仿真系统相互验证。下载地址：https://submissions.mirasmart.com/ISSCC2023/Itinerary/EventDetail.aspx?evt=58

ISSCC的注册信息，可以看到其它所有文档。另外，会场问Graph core的人，他们的用户开始是对 training 需求大，然后过度到 inference 需求大，但客户希望用同一个系统架构可以同时支持training和inference的优化需求。目前，他们95%的需求可以用他们graphcore的on-chip SRAM支撑，不用off-chip memory。但是，他们也说，DL变化太快，需要系统设计时有一定超前性。

。2022年HotChip的Dojo和其它的相关报告： https://hotchips.org/advance-program/

Machine Learning

Groq Software-Defined Scale-out Tensor Streaming Multi-Processor

Dennis Abts, Groq

Boqueria - Next Generation At-Memory Inference Acceleration Device with 1,000+ RISC-V cores

Robert Beachler, Untether AI

DOJO: The Microarchitecture of Tesla’s Exa-Scale Computer

Emil Talpes, Tesla : https:/ /youtu.be/ZL2aD4fKCS4

DOJO - Super-Compute System Scaling for ML Training

Bill Chang, Tesla : https://youtu.be/MWQNjyEULDE?t=5474

Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learningv

Sean Lie, Cerebras
Keynote #2 : https://youtu.be/ZL2aD4fKCS4

f Beyond Compute - Enabling AI through System Integration

Ganesh Venkataramanan, Tesla Motors
https://hc34.hotchips.org/

。https://sharegpt.com/

https://news.ycombinator.com/item?id=34954604

不是所有人都认同这个sharegpt，但也许可以成为一个弯道超车的办法，因为sharegpt可以是建立在chatgpt的label基础上的一个小数据小模型平台。https://sharegpt.com/explore https://sharegpt.com/c/dds1LKN

。