Emerging HPC and data centre applications, in domains such as artificial intelligence (AI) based on large language models (LLMs) and transformer-based architectures, data analytics, scientific computing, enterprise computing etc., are experiencing rapid growth regarding the amount of data to be processed combined with algorithm complexity. To meet these growing demands, the industry and academia are increasingly exploring heterogeneous architectures beyond traditional CPUs and GPUs.
This includes FPGAs, specialized accelerators (xPUs), and an emerging class of architectures based on near-memory and in-memory computing (IMC) concepts, as well as other non-conventional compute paradigms. Many of these innovations are finding their way into commercial products, particularly for inference workloads where energy efficiency is paramount.
The main goal of this workshop is to better understand current and future challenges in achieving resource and energy efficiency for LLMs and AI-centric workloads. We aim to foster discussion on hardware and software co-design across diverse application domains - from data centres to the edge and HPC - while facilitating collaboration between academia and private companies. The workshop will include technical presentations to develop a complete view of the ecosystem, from software to hardware, and build on top of it the next generation of HPC and data centre systems.
The explosive growth of AI workloads, especially LLMs, is driving a shift toward compute architectures that optimize for energy and memory efficiency rather than raw FLOPS alone.
Advances in chiplet-based design, 2.5D/3D packaging, and memory-centric compute models open new frontiers in architectural specialization.
Inference at scale, particularly in edge and low-power settings, motivates exploration of IMC/NMC technologies (e.g., SRAM-based compute arrays, HBM with integrated logic, compute-in-flash).
There is a pressing need to align hardware innovations with the evolving software stacks that support AI, scientific computing, and data-intensive applications.
This workshop will offer a forum for discussing the advancements and challenges in resource- and energy-efficient compute architectures for LLMs, transformers, and related workloads. It aims to:
Explore how different architectural paradigms - FPGAs, xPUs, near- and in-memory computing, and other emerging models - contribute to efficiency gains.
Address both hardware and software challenges, including programming models, toolchains, and compiler support for these architectures.
Examine opportunities to specialize hardware/software solutions by application domain (e.g., edge AI, data centres, HPC) or vertical markets (e.g., automotive, personalized medicine, industrial AI).
Reduce complexity barriers that hinder the wider adoption of unconventional architectures.
The topics of interest for this workshop include, but it is not constrained to the following:
Energy- and resource-efficient architectures for LLMs, transformers, and AI inference/training workloads.
Advances in near-memory and in-memory computing (digital, analog, different memory technologies including SRAM, DRAM, RRAM, PCM, etc.).
FPGA/xPU and other reconfigurable or specialized accelerators for AI and HPC.
Architectural co-design for performance optimization and energy reduction.
Hardware-software co-design: programming models, toolchains, compiler flows.
Case studies on domain specialization (edge AI, data centre, HPC, or specific verticals like healthcare, automotive).
System-level design for composable, heterogeneous infrastructures (including chiplets, 3D integration, and disaggregated compute).
Evaluation methods for energy efficiency, memory bottlenecks, and scalability in AI workloads.
10:00-10:20 Co-organizers: Workshop introduction
Session 1: Efficiency for AI workloads
10:20-11:00 Ana-Lucia Varbanescu/U. Twente, University of Twente: Identifying Waste in LLM processes: an Empirical Approach
Despite all the LLM successes in the past years, their energy consumption poses questions of sustainability. In this talk, we take a different look at the efficiency of LLM processes on current architectures (a typical CPU-GPU system): we focus on determining whether these processes are wasteful. To this end, we propose an intuitive definition for waste, present an approach to identify and quantify waste, and demonstrate it in practice on a common case of LLM training. We further discuss promising avenues to reduce waste, and their possible impact on the LLM processes in general.
11:00-11:30 Coffee break
11:30-12:00 Jiawei Zhuang/Huawei: PTO: Tile-first megakernel programming for AI
Programming modern AI accelerators remains challenging due to complex memory hierarchies, diverse data types and layouts, heterogeneous and concurrent execution units, and manual software pipelining and synchronization. We present PTO (Parallel Tensor/Tile Operation), a framework to streamline megakernel development through a "tile-first" programming model. PTO provides multi-level-compatible abstractions -- bridging Tensor-, Tile-, and ISA-level operations, and supporting both SPMD and MPMD runtime execution. We demonstrate improved productivity and performance by implementing state-of-the-art NPU kernels for LLMs, including DeepSeek Sparse Attention and Gated DeltaNet. The framework is open-sourced at https://gitcode.com/cann/pypto and https://gitcode.com/cann/pto-isa
12:00-12:30 Christos-Savvas Bouganis/Imperial College London: Boosting Sub-8-Bit Large Language Model inference performance
Deploying large-scale LLM models presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant accuracy degradation in LLMs when precision falls below 8 bits. This talk will address this challenge and will focus on our recent effort in pushing the performance further through the integration of sub-8-bit quantisation with SVD-based iterative low-rank tensor decomposition for error compensation.
12:30-13:00 Adrián Rubio/BSC: The Pulse of Generative AI: Visualizing Workload Patterns in High-Performance Clusters
As High-Performance Computing (HPC) infrastructures increasingly support Large Language Model (LLM) workloads, understanding the real-world behavior of these tasks is critical. Unlike traditional scientific simulations which often exhibit steady computational patterns, Generative AI inference pipelines often present unique, erratic resource consumption signatures. This talk presents a workload characterization study based on execution traces from a Tier-0 HPC cluster. By analyzing hardware telemetry—specifically GPU utilization profiles and memory throughput—the session dissects the performance gap between theoretical hardware capabilities and the reality of custom and mixed-library pipelines developed under strict time-to-solution constraints. The analysis demonstrates how prioritizing rapid deployment over hardware-specific optimization leads to diverse unoptimized execution patterns. We visualize distinct behaviors, such as the 'sawtooth' utilization caused by I/O friction in custom data loops, and high-frequency oscillation periods attributed to overheads in standard stacks. The presentation argues that as GPUs become faster, the bottleneck shifts aggressively towards the storage subsystem and the software integration layers, inviting a discussion on how future architectures can better accommodate the 'bursty' nature of agile AI development.
13:00-14:00 Lunch break
Session 2: Green processing for AI workloads
14:00-14:40 Hadjer Benmeziane/IBM Research: In-Memory Computing as a Foundation for Modern AI Hardware
This talk presents in-memory computing as a foundation for modern AI hardware, motivated by the growing mismatch between emerging model structures and conventional compute-centric architectures. We discuss how IMC-enabled designs support both small language models for edge deployment and large-scale cloud models, including Mixture-of-Experts, highlighting key cross-layer design challenges.
14:40-15:20 Johanna Rock/Tenstorrent: Practical Energy Efficiency for AI Workloads: A Tenstorrent Perspective
Large language model (LLM) workloads are typically constrained not by peak compute, but by energy, memory traffic, and system-level efficiency. In this talk, we frame “green processing for AI workloads” using practical, measurable metrics at production-relevant latency targets. We then break down where energy is spent in real deployments - compute, memory movement, and interconnect - and discuss why reducing data movement and chip-to-chip communication often delivers outsized gains. From a Tenstorrent perspective, we outline an end-to-end approach that combines architectural choices from silicon to model-level optimizations, supported by an open and inspectable software stack. Finally, we share forward-looking efficiency themes - spanning silicon, software, and system operation - that aim to further reduce energy per useful output for LLM inference and training in future HPC and data centre environments.
15:30-16:00 Coffee break
Session 3: Green infrastructure for AI workloads
16:00-16:40 Thilo Werner/SpiNNcloud: Beyond Conventional Accelerators: Brain-Inspired Computing for Extremely Energy-Efficient Large-Scale AI
Large-scale AI systems face fundamental energy constraints that conventional architectures struggle to address. This talk presents SpiNNaker2, a brain-inspired system implementing event-based processing, dynamic sparsity, asynchronous communication, and hierarchical organization across all architectural levels. This talk will examine how these biological computing principles can be applied to both spiking neural networks and dynamically sparse algorithms, therefore demonstrating substantial improvements in energy-delay product for inference and training. We discuss how event-driven computation and distributed memory organization differ from conventional synchronous accelerators, and explore implications for energy-efficient AI deployment in HPC and data centre environments.
16:40-17:20 Jesus Escudero Sahuquillo, Universidad de Castilla-La Mancha: AI Models are Hungry (and Thirsty): Enhancements from the Interconnection Network Perspective
Artificial Intelligence (AI) is driving the current industrial revolution with unprecedented levels of innovation acrossnumerous critical areas, including Smart Cities, Agriculture and Homes, autonomous transport, Industry 4.0, and multimedia entertainment, as well as multiple scientific disciplines that rely on High-Performance Computing (HPC). The performance of the data centers that support the AI applications used in the mentioned areas has increased in line with their demands, thanks to significant improvements in computing and memory devices, as well as innovations in the networks that interconnect these devices. Many AI-based models, especially large language models (LLMs), are trained and deployed on energy-hungry servers in massive data centers whose size has been increasing constantly to reachthe required performance, and whose energy consumption has increased proportionally. Also, efforts are underway toaddress scaling in AI water footprint, since many millions of liters of freshwater are consumed for cooling data center servers or for electricity generation. In this context, the interconnection network is a fundamental subsystem in the architecture of data centers and HPC systems, as currently they integrate thousands of computing or storage devices (i.e., CPUs, GPUs, HDDs, or NVMe) that need to communicate when they cooperate to run the mentioned computing- and data-hungry applications. Indeed, new architectures that consider efficient resource and energy-consumption management are required for HPC systems and data-centers, particularly for their interconnection networks. In this talk, recent trends and enhancements to the architecture of interconnection networks will be discussed, with particular focus onthe communication requirements of large AI models, such as LLMs, on the dedicated scale-up and scale-out networks thatenable the interconnection of thousands of accelerators, and on energy efficiency techniques. We will describe differentnetwork topologies, suitable routing algorithms, and efficient resource and energy-consumption management.
17:20-17:30 Workshop closing
Holger Froening (U. Heidelberg, Germany) - froening(at)uni-heidelberg.de
Teresa Cervero (BSC, Spain) - teresa.cervero(at)bsc.es
Dirk Pleiter (U. Groningen, Netherlands) - d.h.pleiter(at)rug.nl
Min Li (Huawei Research Europe) - minli2(at)huawei.com