Program

20 minutes for each oral presentation
(~17 mins for presentation and 3 mins for Q&A)

天 (2023 年 5 月 25 日 星期)

第二天 (2023 年 5 月 26 日 星期五)

Keynote speech

Topic

Good Old Trees along with Emerging Memories

Prof. Kuan-Hsun Chen (陳冠旬 教授)
Department of Computer Science, University of Twente

Abstract

In the era of deep-learning, tree-based machine learning models continue to be popular in critical domains that require explainability. However, several well-known inference libraries still have room for improvement. It is surprising that only a few automation tools exist for decision tree ensembles, even until very recently. On the other hand, emerging non-volatile memories have been widely studied as an alternative to DRAM or for building in-memory computing units, which bring up disruptive hardware constraints that call for novel maintenance strategies in system software. This keynote will offer insights into hardware-aware mapping strategies that deliver efficiency and memory lifetime for decision trees and emerging memories, as well as their interplay. The talk will conclude with future directions from the perspective of compiler techniques and system software to follow up on.

Invited talk I (Andes)

Topic

Code Size Optimizations in RISC-V Toolchain

Shao-Chung Wang王紹仲

Manager, Software division, Andes Technology

Chih-Mao Chen陳枝懋

Toolchain engineer, Andes Technology

Abstract

Code size has been a big challenge in embedded systems for a long time. RISC-V foundation designs code size reduction extension, Zc, to improve code size density. In the past, many compiler optimizations for code size reduction are also proposed. In this talk, we will introduce the code size reduction technologies from source code level to the process of compilation flow. From source code view, the schemes using pragmas, attributes, and built-in functions to help reduce code size are presented. About compilation process, the useful flags for tuning code size are introduced first. Then, we will describe the methods to reduce size from libraries and the algorithms to generate the RISC-V Zc ISA by compiler. Third, several linker optimization technologies for code size reduction will also be introduced. Finally, the experiment will show the total code size reduction from these optimizations.

Invited talk II (SiFive)

Topic

The Journey of RISC-V Vector Extension

Yi-Hsiu Hsu (許益修)

Director, SiFive Taiwan

Kito Cheng (程皇嘉)

Senior Staff Engineer, SiFive Taiwan

Abstract

Vector Extension has been one of the most important extension instructions set architecture by the RISC-V foundation in recent years.  This can enhance product design performance and outcomes substantially and be applied to many fields, such as AI, Computer Vision, Multimedia signal processing, etc.  It would be an indispensable part of the RISC-V CPU ecosystem.   Besides there are many significant hardware-related innovations in the market, software innovation is also in full swing.   In this talk, the speakers will talk about the overview of vector extension and address how SiFive optimized the program with RVV, designed/implemented RVV, migrated code to RISC-V Vector by SiFive Recode, and how Auto Vectorization performed in SiFive.  Also, the talk will share more about SiFive’s current progressing innovations by Vector Extension.

Invited talk III (Skymizer)

Topic

Break the Compilation Boundaries in Hyperscale Accelerators with ONNC

Luba Tang唐文力
CEO and Founder, Skymizer Inc. 

Abstract

The demand for hyper-scale neural network models has driven the development of highly efficient and heterogeneous architectures, such as DLRM and GPT-4 accelerators. However, achieving maximum performance while maintaining low power consumption remains a significant challenge. To address this challenge, this talk explores using two intermediate representations (IRs) in Skymizer’s Open Neural Network Compiler (ONNC) to optimize heterogeneous architectures' performance and power consumption. The Work Level IR abstracts the hardware resource hierarchy to optimize data exchange and management among processing elements. In contrast, the Time-Space IR abstracts the synchronization approaches to reduce overhead and latency. Compiler optimizations, including graph partitioning, modulo scheduling on processing elements, and delayed hot patching in runtime, utilize these IRs to achieve maximum parallelism, utilization, and performance. Experimental results demonstrate significant improvements in the parallelism and utilization of a hyper-scale model, DLRM, on a heterogeneous multicore accelerator, Neuchips’ N3000, which contains a complicated bus and DMA system with more than 22 diverse recommendation-system-specific acceleration engines. Utilizing these IRs and compiler optimizations increased the degree of parallelism from 8 to 16, resulting in x8.5 utilization improvement.