Title: A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions
Associate Professor, University of Cambridge, UK
Bio
Tobias Grosser is an Associate Professor at the University of Cambridge. Before, he worked as a Reader at the University of Cambridge, as an Ambizione Fellow at ETH Zurich, and as a Google PhD Fellow at INRIA/Paris IV/ENS Paris. Tobias and his research group have a decade-long history of contributing to the LLVM ecosystem. Tobias developed polyhedral loop optimizations in LLVM/Polly, worked on hardware design with LLHD/CIRCT, and developed the xDSL Python-Native MLIR-style compiler framework, many of which he used to target low-power AI accelerators. Over the last years, he started to look into formal methods in the context of the Lean ITP.
Abstract
High-performance micro-kernels must fully exploit today's diverse and specialized hardware to deliver peak performance to deep neural networks (DNNs). While higher-level optimizations for DNNs are offered by numerous compilers (e.g., MLIR, TVM, OpenXLA), performance-critical micro-kernels are left to specialized code generators or handwritten assembly. Even though widely-adopted compilers (e.g., LLVM, GCC) offer tuned backends, their CPU-focused input abstraction, unstructured intermediate representation (IR) and general-purpose best-effort design inhibit tailored code generation for innovative hardware. We think it is time to widen the classical hourglass backend and embrace progressive lowering across a diverse set of structured abstractions to bring domain-specific code generation to compiler backends. We demonstrate this concept by implementing a custom backend for a RISC-V-based accelerator with hardware loops and streaming registers, leveraging knowledge about the hardware at levels of abstraction that match its custom instruction set architecture (ISA). We use incremental register allocation over structured IRs, while dropping classical spilling heuristics, and show up to 90% floating-point unit (FPU) utilization across key DNN kernels. By breaking the backend hourglass model, we reopen the path from domain-specific abstractions to specialized hardware and kick-off a new generation of open compiler backends in the LLVM
Title: Compilers as Guardians: Reliability and Security in Intermittent Computing
Associate Professor, Purdue University, USA
Bio
Changhee Jung is a Samuel D. Conte Associate Professor of Computer Science at Purdue University. He received his PhD degree in Computer Science from Georgia Tech in 2013. His research interests are in compilers and computer architecture, with an emphasis on performance, reliability, and security. His work has appeared in top conferences such as MICRO, ISCA, ASPLOS, PLDI, OSDI, and RTSS. He received the NSF Career Award, AMD/Google Faculty Research Awards, and the Silver Prize in the SAMSUNG HumanTech Thesis Competition. Recently, he was inducted into MICRO Hall of Fame. Currently, he is serving as an Associate Editor for ACM Transactions on Computer Systems (TOCS).
Abstract
In this talk, I will present three of my research projects in intermittent computing: RockClimb (and its extension), GECKO, and Caphammer. RockClimb enables stagnation-free intermittent execution through power failure immunity---a must-have property for achieving reliability and high performance in intermittently powered systems. In contrast, GECKO and Caphammer address security vulnerabilities that can lead to denial-of-service attacks and incorrect recovery from power outages. At the end of the talk, I will also briefly talk about additional critical challenges in intermittent computing and discuss how they can be addressed using lightweight yet effective solutions.
Title: Advancing On-Device Training through Compiler Technology
Bio
Dr. BY Shen is a Senior Technical Manager in the Computing and Artificial Intelligence Technology Group at MediaTek, with over a decade of industry experience in compiler design and optimization. Since 2017, he has led the development of the NeuroPilot AI compiler infrastructure, which is widely used across MediaTek products to enable Edge AI capabilities. Dr. Shen holds a Ph.D. in Computer Science from National Chiao Tung University, Taiwan.
Abstract
On-device training for edge devices presents unique optimization challenges, particularly in storage and memory footprint. In 2024, MediaTek became the world’s first to enable on-device LoRA fine-tuning, successfully bringing this technology to commercial products in 2025. This talk explores how MediaTek overcomes these challenges using advanced compiler optimization techniques, with a focus on strategies for reducing storage requirements and memory usage. Practical solutions implemented in the NeuroPilot AI compiler are highlighted, along with real-world deployment experiences that demonstrate how efficient and scalable on-device training can be achieved for edge AI applications.
Title 1: RISC-V position and real Andes success story in LLM AI
Title 2: Enhancing Compiler Optimization with a Cycle-Accurate Simulator
Title 3: AI-Assisted Exponential Function Approximation for LLM Workloads in the Andes ACE Workflow
Bio
王庭昭現任晶心科技資深市場技術經理,在此之前他曾在聯發科及四零四科技擔任軟體架構師。他擁有國立陽明交通大學電控系所學士及碩士學位。
Abstract
In this presentation, we will dive into how RISC-V is shaking up the AI world by letting developers build exactly what they need. Most importantly, I’m sharing the real-world success story of Andes Technology. We’ll look at how their RISC-V cores are actually powering LLMs today, proving that RISC-V architecture is flexible, extensible, and powerful for the next gen of AI.
Bio
曾任職於聯發科技,從事相機 3A/ISP 軟體開發工作,具備影像處理與系統軟體實務經驗。現任職於晶心科技,專注於 SystemC TLM 系統建模與 toolchain 開發,投入於處理器架構相關的軟體與系統層技術。
Abstract
Compiler optimization often relies on LLVM scheduler models and llvm-mca for pipeline and latency analysis. However, when microarchitectural behavior cannot be fully abstracted—such as multiple execution stages or ambiguous corner-case latencies—static models can diverge from real hardware performance. This talk demonstrates how a cycle-accurate simulator can be integrated into the compiler optimization workflow as a practical analysis tool. Three case studies are presented: When hardware implements separate EX/LX stages but LLVM models only a single execution resource, simulation helps analyze stage contention and guide scheduler tuning. When reduction instruction latencies are unclear in the hardware specification, simulation enables more accurate scheduling decisions. In an AI kernel acceleration case, simulation exposes the true bottleneck, motivates the introduction of a custom instruction, and verifies the resulting performance gain. The methodology can be combined with LLVM LNT and automated workflows to accelerate optimization iterations. Compared to costly and time-consuming FPGA validation, a pure software cycle-accurate simulator enables lower-cost, faster experimentation in early development stages. Ultimately, a cycle-accurate simulator bridges the gap between microarchitecture and compiler optimization. By grounding compiler decisions in real hardware behavior, developers can achieve greater accuracy and faster iteration cycles, ensuring that software is truly tuned for the silicon it runs on.
Bio
交通大學博士,曾任職於智原科技、創意電子及聯詠科技,從事多媒體壓縮、訊號處理與電腦視覺相關研究與開發工作。現任晶心科技運算加速研發處副處長,專注於人工智慧相關技術與應用開發。
Abstract
The exponential function is a key computational component in large language model (LLM) workloads, particularly in softmax and attention operations. Efficient and flexible implementations are critical for meeting diverse accuracy, latency, and hardware requirements in AI systems. This work explores multiple approximation techniques for the exponential function, including polynomial and table-based methods, and evaluates their trade-offs in numerical accuracy and computational cost. We introduce an AI-assisted development flow that generates both C reference code and synthesizable Verilog RTL from mathematical specifications. The generated designs are integrated into the Andes ACE workflow for system-level verification. This study demonstrates a practical methodology that can be adapted to different design objectives, allowing users to guide and customize parts of the workflow. This approach highlights how AI-assisted generation can accelerate function development, reduce implementation effort, and support integration across diverse software and hardware applications.
Title 1: The Gold Standard of RISC-V
Title 2: Beyond Fixed Widths: Embracing Scalable Vectorization in Modern Compiler
Site Lead and Senior Director of Software Engineering, SiFive Taiwan
Bio
Peter Liao 帶領 SiFive 臺灣 CPU 設計團隊,致力於開發領先全球的 RISC-V CPU IP 解決方案,並在晶片軟硬體設計效能上持續突破。 Peter 累積超過 20 多年工作經驗在 CPU 與 AI 相關軟硬體效能優化與研發驗證,主要專長是 performance optimization and system verification。曾任職於外商晶片設計公司,包含 Marvell、Faraday、BITMAIN 等。Peter 畢業於國立臺灣大學,並獲得資訊工程碩士學位。
Abstract
This session explores RISC-V’s evolution from an university project to the global open-standard ISA. We examine how its flexibility overcomes the limitations of proprietary architectures, driving innovation in AI and custom silicon. As the industry leader, SiFive showcases its high-performance IP portfolio, empowering the next generation of computing.
Senior Staff Software Compiler Engineer, SiFive Taiwan
Bio
Shih-Po Hung is a compiler engineer at SiFive, where he blends his interests in computer architecture and performance optimization. He is currently focused on integrating RISC-V extensions into loop vectorization, bridging advanced hardware capabilities with modern compiler infrastructure to deliver efficient, scalable software for the RISC-V ecosystem.
Abstract
The RISC‑V Vector (V) extension exemplifies a fundamental shift away from fixed-width SIMD toward scalable, length-agnostic vector execution. This talk explores how LLVM loop vectorizer can exploit the V extension by rethinking traditional loop transformation and vectorization strategies. Topics include vector-length–agnostic loop shaping, masking and tail handling, and the implications for portability across microarchitectures with different maximum vector widths.
Title: Democratizing Generative AI for Everyone
Bio
With 8 years of experience in NAND Flash storage industry, Gary specializes in firmware algorithm design for NAND Flash controller. He currently leads aiDAPTIV development team for the design and development of aiDAPTIVLink technology which enables on-premise AI solution.
Abstract
What is Phison doing
How is Phison related to AI
What is aiDAPTIV+
How can aiDAPTIV+ be integrated into your life