Workshop Schedule

08:30 - 08:35

Welcome & Intro

08:35 - 09:00

Representation Learning for Computation Graphs [Slides]

Zhiru Zhang - Cornell University

Abstract: Graph representation learning has garnered significant interest recently for encoding graph topology into compact representations, preserving crucial structural information for downstream tasks. In this talk, I'll discuss our latest research on accurate and scalable representation learning for computation graphs, which are playing an ever-growing role in many EDA and compilation problems. I'll introduce HOGA, a hop-wise graph attention approach for circuit representation learning, emphasizing its efficiency in distributed training and generalizability to new circuit designs. Additionally, I'll briefly discuss Polynormer, a novel graph transformer with linear time complexity. Our experiments show promising results across diverse graph benchmarks, including Google TpuGraphs for predicting AI model runtime under various compiler configurations.

09:00 - 09:25

Is it possible to build a multi-FPGA LLM? [Slides]

Paul Chow - University of Toronto

Abstract: There are many examples of machine learning applications implemented on a single, or a few, FPGAs.  At this scale, implementation can still be treated as the construction of a hardware circuit:  design accelerator, add I/O, implement, test.  When considering large-scale applications, such as large language models that already use hundreds or thousands of GPUs, then it is expected that a similar number of FPGAs would be required to implement the same application.  It is not feasible to scale the "hardware" design approach to such large systems.  The software world has already faced this challenge and developed approaches for building large-scale distributed applications.  If we are ever to consider implementing LLMs with many FPGAs, then we must learn from software and develop similar technology: we need a way to describe the application in a way that can be mapped to multiple FPGAs, provide an abstraction that enables a clear description of the application and have a platform for easy deployment onto multiple FPGAs.  All of this should be done in a way that hides the fact that the implementation is even being done on FPGAs.  Application development for multi-FPGA systems must be more like distributed programming than building a large number of connected hardware circuits. In this presentation, we describe our first efforts to address the challenge of building large-scale LLMs on FPGAs.  We do this by building a proof-of-concept, multi-FPGA implementation of a small transformer keeping in mind the goal of scaling to thousands of network-connected FPGAs.  Starting with the Galapagos system for deploying multi-FPGA applications, we add the ability to scale beyond the limits of Galapagos, we add collective communications, we develop a tool for building clusters of Galapagos clusters from a higher-level network description and then we implement a small transformer using multiple small FPGAs.  This is a proxy for building larger transformers with larger FPGAs.  Based on this working implementation, we can estimate how the system will work and perform with larger and more modern FPGAs. Our conclusion is that our approach shows promise for deploying large-scale applications on multiple FPGAs and that FPGAs can be competitive with GPU performance.  The expectation is that there will be a significant power advantage, but that is our next step.

09:25 - 09:50

Dynamic DNNs and Runtime Resource Adaptation for Efficient On-Device Inference [Slides]

Lei Xun - University of Southampton

Abstract: The landscape of on-device inference is increasingly being shaped by advancements in FPGA technologies for various DNN model architectures. Dynamic DNNs, as explored in recent FPGA research, offer promising solutions by strategically deactivating parts of the model, thus achieving faster execution and reduced computational intensity. However, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources, and this raises two main challenges: Runtime Hardware Availability and Runtime Application Variability, both motivating the demand for more versatile systems. In this presentation, I will explore how we address these challenges through a system-level runtime solution for DNN performance trade-off management, combining the runtime trade-off opportunities in both models and hardware to meet dynamically changing application performance targets and hardware constraints in real-time. I will also discuss how we co-designed Dynamic ConvNets and Transformers for embedded CPU, GPU, and accelerators, and highlight the demand for more versatile hardware, steering our future explorations towards platforms like compute-in-memory and reconfigurable hardware.

09:50 - 10:30

Industry Keynote: The Architecture and Programming Model for the Groq TSP/LPU

Satnam Singh - Groq

Abstract: This talk gives an overview of the Groq TSP/LPU architecture along with a flavor of its programming model. The talk illustrates how machine learning models from mainstream platforms like PyTorch and Tensorflow are mapped to a spatial statically scheduled architecture, as well as how domain-specific languages can be used to transform more general linear algebra computations for efficient execution on Groq's TSP/LPU chips.

Coffee Break (10:30 - 11:00)

11:00 - 11:25

FPGA Architecture for Deep Learning [Slides]

Vaughn Betz - University of Toronto

Abstract: FPGAs have unique capabilities that make them an attractive platform for accelerating deep learning (DL) inference. They offer the ability to customize processing pipelines and thus achieve higher efficiency compared to general-purpose CPUs and GPUs, at a fraction of the development time and cost of specialized ASICs. Their diverse IOs also enable direct interfacing to the network and a variety of sensors/actuators, making them suitable for both datacenter and edge use cases. With DL inference becoming a major market segment, FPGA architecture is evolving to match its requirements. This talk will give an overview of DL-targeted architecture enhancements to existing FPGA components (e.g. logic blocks, DSPs, BRAMs) as well as newly-introduced tensor compute blocks. It will also highlight promising directions for future research on beyond-FPGA reconfigurable devices that combine conventional FPGA fabrics with coarse-grained DL accelerator cores and NoCs for efficient system-level communication. 

11:25 - 11:50

Robust GNN-based Representation Learning for HLS [Slides]

Atefeh Sohrabizadeh - UCLA

Abstract: The efficient and timely optimization of microarchitecture for a target application is hindered by the long evaluation runtime of a design candidate, creating a serious burden. To tackle this problem, researchers have started using learning algorithms such as graph neural networks (GNNs) to accelerate the process by developing a surrogate of the target tool. However, challenges arise when developing such models for HLS tools due to the program's long dependency range and deeply coupled input program and transformations (i.e., pragmas). To address them, we present HARP (Hierarchical Augmentation for Representation with Pragma optimization) with a novel hierarchical graph representation of the HLS design by introducing auxiliary nodes to include high-level hierarchical information about the design. Additionally, HARP decouples the representation of the program and its transformations and includes a neural pragma transformer (NPT) approach to facilitate a more systematic treatment of this process. Our proposed graph representation and model architecture of HARP not only enhance the performance of the model and design space exploration based on it but also improve the model's transfer learning capability, enabling easier adaptation to new environments.

11:50 - 12:15

FPGAConvNet: Automated Design of Convolutional Neural Network Accelerators for FPGAs [Slides]

Alex Montgomerie - Imperial College London

Abstract: FPGA devices are a promising platform for deploying Convolutional Neural Network (CNN) models in a wide range of settings, from embedded systems to data centers. The fine-grain configurability of FPGA devices affords them the ability to deploy highly customized accelerator designs. However, discovering the optimal accelerator design for a specific CNN workload and objective is challenging given the vastness and complexity of the design space. This talk discusses the FPGAConvNet framework, which makes use of streaming architecture components and automated design space exploration techniques to solve the optimization problem of mapping CNN devices to FPGAs. The talk will delve into the details of this framework and how it has been used to discover high-performance designs for applications such as Image Classification, Object Detection, and TinyML.

Lunch Break (12:15 - 13:45)

13:45 - 14:25

Industry Keynote: AIR - Spatial Programming for Ryzen AI and Beyond

Samuel Bayliss - AMD

Abstract: Some users seek close-to-the-metal abstractions that give timely access to the latest accelerator hardware, while others seek to program in frameworks and domain-specific languages that promise longevity and portability across many hardware generations. AIR is a programming abstraction that aims to expose the characteristics of a broad set of spatial programming architectures, aiming to be an abstraction that can be efficiently mapped to a wide range of hardware and act as a target for many different application domains. We show how it can be used in an application stack mapping AI applications from PyTorch through IREE to implementation on AMD Ryzen AI accelerators.

Short Break & Panel Setup (14:25 - 14:30)

14:30 - 15:30

Panel Discussion: Opportunities & Challenges for Working Across Abstraction Layers for DL

Panelists: Jason Anderson (University of Toronto) - Jason Cong (UCLA) - Stephen Neuendorffer (AMD) - Theo Drane (Intel) - Wayne Luk (Imperial College London)

Abstract: Improving the performance of deep learning applications requires extensive cross-stack optimizations, ranging from model structure and compiler optimizations to accelerator architecture, arithmetic circuitry, and physical design implementation. Focusing on a specific layer of this stack limits the multiplicative gains we can realize through a more vertical optimization approach. On the other hand, working across different abstraction layers is very challenging as it significantly expands the design space to be explored and often requires expertise in a variety of domains. In this panel, we will discuss different views on the most promising opportunities for cross-stack optimization in the DL domain and the biggest challenges we need to solve as a community to unlock the full potential of spatial DL acceleration.

15:30 - 15:45

Workshop Closing