Publications

Publishing your results is the most rewarding feeling,

because you Are contributing in the process of science

to make a difference

Physical Oscillator Model for Supercomputing

Ayesha Afzal, Georg Hager, Gerhard Wellein

IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

(Denver, CO, USA, 2023-11-12/2023-11-17, Available with Open Access)

Abstract

A parallel program together with the parallel hardware it is running on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model (POM) to describe aspects of the dynamics of interacting parallel processes. Motivated by the well-known Kuramoto Model, a process with its regular compute-communicate cycles is modeled as an oscillator which is coupled to other oscillators (processes) via an interaction potential. Instead of a simple all-to-all connectivity, we employ a sparse topology matrix mapping the communication structure and thus the inter-process dependencies of the program onto the oscillator model and propose two interaction potentials that are suitable for dierent scenarios in parallel computing: resource-scalable and resource-bottlenecked applications. The former are not limited by a resource bottleneck such as memory bandwidth or network contention, while the latter are. Unlike the original Kuramoto model, which has a periodic sinusoidal potential that is attractive for small angles, our characteristic potentials are always attractive for large angles and only differ in the short-distance behavior. We show that the model with appropriate potentials can mimic the propagation of delays and the synchronizing and desynchronizing behavior of scalable and bottlenecked parallel programs, respectively.

Paper

Preprint

SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

Ayesha Afzal, Georg Hager, Gerhard Wellein

IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

(Denver, CO, USA, 2023-11-12 /2023-11-17, Available with Open Access)

Abstract

In this work, fundamental performance, power, and energy characteristics of the full SPEChpc 2021 benchmark suite are assessed on two different clusters based on Intel Ice Lake and Sapphire Rapids CPUs using the MPI-only codes’ variants. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy to solution.

Paper

Preprint

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Ayesha Afzal, Georg Hager, Stefano Markidis, Gerhard Wellein

Future Generation Computer Systems (FGCS)
(22 June 2023)

Abstract

Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications.

Paper

Preprint

The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

Ayesha Afzal, Georg Hager, Gerhard Wellein

IEEE Transactions on Parallel and Distributed Systems (TPDS)

(10 November 2022)

Abstract

The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of "bottleneck evasion" and how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime.

Paper

Preprint

Bibtex

HPCwire

Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Ayesha Afzal, Georg Hager, Gerhard Wellein, Stefano Markidis

PPAM ’22: the 14th International Conference on Parallel Processing and Applied Mathematics
(Gdansk, Poland, 2022-09-11/2022-09-14, Available with Open Access)

Abstract

This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.

Paper

Preprint

Bibtex

Addressing White-box Modeling and Simulation Challenges in Parallel Computing

Ayesha Afzal, Gerhard Wellein, Georg Hager

SIGSIM-PADS ’22: ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
(GA, Atlanta, USA, 2022-06-08/2022-06-10)

Abstract

“White-box” performance modeling of distributed-memory applications is notoriously inaccurate due to the wide spectrum of disturbances in the application and the system. Even for computational science applications that have extremely regular and homogeneous compute-communicate phases and a perfect translational symmetry across processes, simply adding communication time to computation time does often not yield an adequate estimate of parallel runtime. This is due to deviations from the expected “lock-step” execution; processes get out of sync and produce multi-faceted performance patterns. Prior related work has been conducted about the characterization of disturbances and their mitigation via explicit techniques. In contrast, an exhaustive theory of dynamics destroying the “lock-step” pattern is not available and is a great challenge for code optimization, simulation and performance modeling....

Paper

Bibtex

analytic performance model for Parallel overlapping memory-bound kernels

Ayesha Afzal, Georg Hager, Gerhard Wellein

Concurrency and Computation: Practice and Experience
(January 2022, Available with Open Access)

Abstract

Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modified versions of the HPCG benchmark, we construct a performance model of execution of memory-bound loop kernels. It can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with. The only code features required are the single-thread memory request fraction per kernel, which is directly related to the single-thread memory bandwidth, and its saturated bandwidth. The former can either be measured directly or predicted using the Execution-Cache-Memory (ECM) performance model. The computational intensity of the kernels and the detailed structure of the code is of no significance. We validate our model on Intel Broadwell, Intel Cascade Lake, and AMD Rome processors pairing various streaming and stencil kernels. The error in predicting the bandwidth share per kernel is less than 8%.

Paper

Preprint

Bibtex

[Poster] White-box modelling of parallel computing dynamics

Ayesha Afzal, Georg Hager, Gerhard Wellein

The 5th International Conference on High Performance Computing in Asia-Pacific Region
(Virtual, Online, 2022-01-12/2022-01-14)

Abstract

"White-box" performance modeling of distributed-memory applications is often inaccurate due to the wide spectrum of disturbances, which have a variety of performance impacts. Establishing a comprehensive theory of the underlying dynamics that destroy the "lock-step" pattern of bulk-synchronous parallel programs is a great challenge in HPC. As a first step, we have developed a validated analytic model of the propagation speed of "idle waves." Idle waves emerge when delays on individual MPI processes propagate across the others depending on the execution and communication properties of the program. We use a spectrum of HPC resources and application scenarios to further explore how these idle waves interact nonlinearly within a parallel code on a cluster and what are the key ingredients in their decay. A bottleneck among processes may break the inherent symmetry of the underlying software and hardware, leading to a structured, stable desynchronized state and an automatic communication and computation overlap. Although desynchronization can cause increased per-process waiting time, it can also boost the per-process memory bandwidth via bottleneck evasion, improving the overall time to solution. We have devised a performance model that can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and the bandwidth characteristics of paired kernels. Our innovative modeling approach addresses the analysis and simulation challenges of the dynamics of parallel computing and connects them to the physical world via a coupled oscillator model serving as a high-level cluster characterization tool.

Slides

Abstract

Poster

Bibtex

Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact

Ayesha Afzal, Georg Hager, Gerhard Wellein

Chamberlain B.L., Varbanescu AL., Ltaief H., Luszczek P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science, vol 12728. Springer, Cham

(Virtual, Online, 2021-06-24/2021-07-02)

Abstract

Most distributed-memory bulk-synchronous parallel programs in HPC assume that compute resources are available continuously and homogeneously across the allocated set of compute nodes. However, long one-off delays on individual processes can cause global disturbances, so-called idle waves, by rippling through the system. This process is mainly governed by the communication topology of the underlying parallel code. This paper makes significant contributions to the understanding of idle wave dynamics. We study the propagation mechanisms of idle waves across the processes of MPI-parallel programs. We present a validated analytic model for their propagation velocity with respect to communication parameters and topology, with a special emphasis on sparse communication patterns. We study the interaction of idle waves with MPI collectives and show that, depending on the implementation, a collective may be permeable to the wave. Finally we analyze two mechanisms of idle wave decay: topological decay, which is rooted in differences in communication characteristics among parts of the system, and noise-induced decay, which is caused by system or application noise. We show that noise-induced decay is largely independent of noise characteristics but depends only on the overall noise power. An analytic expression for idle wave decay rate with respect to noise power is derived. For model validation we use microbenchmarks and stencil algorithms on three different supercomputing platforms.

Paper

Preprint

Slides

Video

Bibtex

[Poster] Physical Oscillator Model for Parallel Distributed Computing

Ayesha Afzal, Georg Hager, Gerhard Wellein

ISC High Performance Conference 2021

(Virtual, Online, 2021-06-24/2021-07-02)

Abstract

We propose a nonlinear physical model for describing the dynamics of processes in message-passing parallel programs. The starting point is the Kuramoto model, which exhibits self-synchronization across oscillators mediated by a long-range sinusoidal interaction potential. The phenomenology of distributed applications with regular compute-communicate cycles suggests a physical interpretation of processes as coupled oscillators. The bottleneck structure in terms of memory bandwidth, network injection bandwidth, and full-system bisection bandwidth is reflected in the interaction potential. A bottleneck-free program is best described by a Kuramoto-like potential, since the dynamics of the parallel code exhibits auto-synchronization. In presence of bottlenecks, a short-range repulsive interaction must be superposed, yielding stability points away from the translationally symmetric, unstable bulk-synchronous mode. The modified Kuramoto model can characterize the dynamics of massively parallel code via a system of coupled ODEs. Although the optimal interaction potentials are still unknown, the model qualitatively describes a surprising number of effects such as resynchronization and desynchronization, propagation and decay of idle waves, bottleneck evasion via computational wavefronts, and the impact of fine-grained noise on program performance. As a major feature, it mimics the symmetry-breaking behavior of bottlenecked programs. Although large- and medium-scale phenomena are well described at least qualitatively, the microscopic behavior is certainly problematic. Current research includes the search for interaction potentials that provide a more quantitative characterization on the higher levels, the incorporation of fine-grained noise in the model, and the investigation of potential Goldstone modes emerging from the breaking of a global symmetry.

Slides

Poster

Video

Bibtex

Desynchronization and wave pattern formation in MPI-parallel and hybrid memory-bound programs

Ayesha Afzal, Georg Hager, Gerhard Wellein

Sadayappan P., Chamberlain B., Juckeland G., Ltaief H. (eds) High Performance Computing. ISC High Performance 2020. Lecture Notes in Computer Science, vol 12151. Springer, Cham

(Virtual, Online, 2020-06-22/2020-06-25)

Abstract

Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting.

Preprint

Paper

NCBI

Slides

Video

Bibtex

ClusterCockpit—A web application for job-specific performance monitoring

Jan Eitzinger, Thomas Gruber, Ayesha Afzal, Thomas Zeiser, Gerhard Wellein

2019 IEEE International Conference on Cluster Computing (CLUSTER)

(Albuquerque, NM, USA, 2019-09-23/2019-09-26)

Abstract

Monitoring is a common component of HPC system software. Up to now, monitoring focused mainly on health checking and system level performance as well as on job scheduler information and was targeted towards system administrators. Recently job-specific performance monitoring based on hardware performance counter metrics has gained attention at academic HPC computing centers. HPC is becoming a mainstream tool that is also used by non-HPC experts, and HPC centers see a demand to check for pathological jobs and jobs with large optimization potential. The possibility to measure hardware performance counter data with negligible overhead allows assessment of efficient resource utilization and detection of pathological jobs. Pathological jobs are, e.g. jobs with errors in the batch script, jobs which do not terminate, jobs with severe load imbalance, or jobs that do not use any resources. This paper introduces ClusterCockpit, a web front-end tailor-made tool for job-specific performance monitoring. While many recent job-specific performance monitoring efforts concentrate on the measurement and data collection layers, ClusterCockpit provides a modern user interface targeted towards performance analysts as well as application users.

Paper

Bibtex

[Poster] Delay Flow Mechanisms on Clusters

Ayesha Afzal, Georg Hager, Gerhard Wellein

EuroMPI '19 Proceedings of the 26th European MPI Users' Group Meeting

(Zurich, Switzerland, 2019-09-10/2019-09-13)

Abstract

Analytic runtime models of distributed-memory applications are often inaccurate because of the wide range of effects that can disturb the regular compute-communicate cycle. Possible sources of disturbance are long-duration delays, fine-grained on-node system noise, variations in network performance, network contention, and application load imbalance. There is to date no comprehensive theory about how delays from different sources travel through a parallel application running on a cluster, or how collective phenomena break the inherent symmetry of the underlying software and hardware. This is partly because of the huge parameter space involved. In this poster, we use synthetic microbenchmarks to highlight three effects that are of importance in this context: propagation of long-term delays, noise-assisted decay of propagating delays, and noise-induced desynchronization of memory-bound applications. Especially the latter leads to surprising insights about performance features of hybrid (MPI+OpenMP) parallel applications.

PDF

Poster

Photo

Bibtex

Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

Ayesha Afzal, Georg Hager, Gerhard Wellein

2019 IEEE International Conference on Cluster Computing (CLUSTER)

(Albuquerque, NM, USA, 2019-09-23/2019-09-26)

Abstract

Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called “noise”) run contrary to the assumptions about regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long, one-off delays of execution periods that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pin-point the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of “idle waves,” i.e., propagating phases of inactivity, emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications.

Preprint

Paper

Slides

Bibtex

OpenCL-based FPGA design to accelerate the nodal discontinuous Galerkin method for unstructured meshes

Tobias Kenter, Gopinath Mahale, Samer Alhaddad, Yevgen Grynko, Christian Schmitt, Ayesha Afzal, Frank Hannig, Jens Förstner, Christian Plessl

2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

(Boulder, CO, USA, 2018-04-29/2018-05-01)

Abstract

The exploration of FPGAs as accelerators for scientific simulations has so far mostly been focused on small kernels of methods working on regular data structures, for example in the form of stencil computations for finite difference methods. In computational sciences, often more advanced methods are employed that promise better stability, convergence, locality and scaling. Unstructured meshes are shown to be more effective and more accurate, compared to regular grids, in representing computation domains of various shapes. Using unstructured meshes, the discontinuous Galerkin method preserves the ability to perform explicit local update operations for simulations in the time domain. In this work, we investigate FPGAs as target platform for an implementation of the nodal discontinuous Galerkin method to find time-domain solutions of Maxwell's equations in an unstructured mesh. When maximizing data reuse and fitting constant coefficients into suitably partitioned on-chip memory, high computational intensity allows us to implement and feed wide data paths with hundreds of floating point operators. By decoupling off-chip memory accesses from the computations, high memory bandwidth can be sustained, even for the irregular access pattern required by parts of the application. Using the Intel/Altera OpenCL SDK for FPGAs, we present different implementation variants for different polynomial orders of the method. In different phases of the algorithm, either computational or bandwidth limits of the Arria 10 platform are almost reached, thus outperforming a highly multithreaded CPU implementation by around 2x.

Paper

Bibtex

Solving Maxwell's Equations with Modern C++ and SYCL: A Case Study

Ayesha Afzal, Christian Schmitt, Samer Alhaddad, Yevgen Grynko, Jurgen Teich, Jens Forstner, Frank Hannig

2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
(Politecnico di Milano, Milan, Italy, 2018-07-10/2018-07-12)

Abstract

In scientific computing, unstructured meshes are a crucial foundation for the simulation of real-world physical phenomena. Compared to regular grids, they allow resembling the computational domain with a much higher accuracy, which in turn leads to more efficient computations. There exists a wealth of supporting libraries and frameworks that aid programmers with the implementation of applications working on such grids, each built on top of existing parallelization technologies. However, many approaches require the programmer to introduce a different programming paradigm into their application or provide different variants of the code. SYCL is a new programming standard providing a remedy to this dilemma by building on standard C++ 17 with its so-called single-source approach: Programmers write standard C++ code and expose parallelism using C++ 17 keywords. The application is then transformed into a concrete implementation by the SYCL implementation. By encapsulating the OpenCL ecosystem, different SYCL implementations enable not only the programming of CPUs but also of heterogeneous platforms such as GPUs or other devices. For the first time, this paper showcases a SY CL-based solver for the nodal Discontinuous Galerkin method for Maxwell's equations on unstructured meshes. We compare our solution to a previous C-based implementation with respect to programmability and performance on heterogeneous platforms.

Preprint

Paper

Slides

Bibtex

[Thesis] The Cost of Computation: Metrics and Models for Modern Multicore-based Systems in Scientific Computing

Ayesha Afzal

Masterarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg

(Erlangen-Nürnberg, Germany, 2015-07)

Abstract

The increasing concern on power consumption in many computing systems points to need for power/energy modelling and estimation of high-end computing systems. The goal of the present work is to propose power/energy models that can predict the run-time energy consumption of loop kernels and programs by specifying their properties with respect to scaling behaviour, data transfer through the memory hierarchy, and low-level operations. This model should ultimately be able to answer following questions: “How algorithm properties may help to inform power management?” “How can a program be run so that the overall energy consumption is minimized without compromising time to solution?” “How can a program can be executed so that the overall energy consumption is minimized, with a maximum increase in time to solution?” “How to treat a problem slowdown in case of the power capping? What is the potential of larger machines towards energy saving by compensating for the slow code?” This work proposes component level (CPUs and DRAMs) power and/or energy models and synthesizes a significant number of techniques to get an energy-efficient system for a wide range of situations by focusing on common architectural characteristics. To validate predicted models and results, experimental measurements based on the selective benchmarks are compared against estimated model’s outcomes on multi-core based systems. The present work describes the characteristics of different multi-core processors available in RRZE computing centre by elaborating every micro-architecture change with a die shrink of the process technology. A comparison of these systems is performed from the perspective of the performance, the power dissipation and especially relevant from considerations of the energy consumption. A statistical analysis of power characteristics is obtained to get an idea about variations of the power dissipation across the Ivy Bridge EP processor in “emmy” production system available at RRZE centre. This enables us to define a policy for power aware scheduling on “emmy” cluster and thereby saves energy cost of “emmy” cluster. This work also elaborates how different type of processor executed instructions and the tunable intensity benchmarks effect the power dissipation and its model parameters Wi on recent processors. In addition, it elaborates that low level microscopic parameters (i.e., energy cost for one flop or for one byte transfer) between memory hierarchy levels predict energy cost at macro- scopic level. Finally these two modelling techniques are combined to determine macroscopic Wi parameters form these low level microscopic parameters.

PDF

Bibtex

White-box modeling for performance and energy: useful patterns for resource optimization

Georg Hager, Ayesha Afzal

Invited talk, Workshop on Power-Aware COmputing (PACO)
(Max Planck Institute, Magedeburg, Germany, 2015-07-06/2015-07-07)

Abstract

A realistic upper limit for the performance of a code on a particular computer hardware may be called its light speed. Light speed allows a well-defined answer to the question whether an implementation of an algorithm is “good enough.” A model leading to an accurate light speed estimate requires thorough code analysis, knowledge of computer architecture, and experience on how software interacts with hardware. The notion of light speed depends very much on the machine model underlying the hardware model; if the machine model misses an important performance-limiting detail, one might arrive at the (false) conclusion that light speed is not attained by the code at hand, while it actually is. Which hardware features should be included to arrive at a good balance between simplicity and predictive power is a crucial question, and this talk tries to give useful answers to it. Two pivotal concepts are the cornerstones of the modeling process: bottlenecks and performance patterns. A bottleneck is a hardware feature that limits the performance of a program. A performance pattern is a performance-limiting motif in which one or more bottlenecks (or a complete lack thereof) may be present. Identifying a performance pattern via observable behavior is the first step towards building a useful performance model. In complex cases it may not be possible to establish a model at all. If a model can be built, one can gain a deeper understanding of the interactions between software and hardware. If the model works, i.e., if the its predictions can be validated by measurements, this is an indication that it describes certain aspects of this interaction accurately. If the model does not work, it must be refined, leading to more insights. A working model can help with predicting the possible gain of code optimizations. Changing the program code may require adjustments in the model, or even building a completely new model when the underlying algorithm was changed. When quantitative insight into the performance aspects of an implementation has been gained, one can proceed to include energy aspects in the modeling process. To lowest order, the energy used for performing some computation is proportional to the wall-clock time required. Starting from this assertion, together with some simplifying assumptions about scalability behavior and the dependence of power dissipation on clock speed and the number of cores used, one can construct a simple chip-level power/performance model that yields surprisingly deep insights into the energy aspects of computation. The talk presents examples that reveal the interplay between clock speed dependence and scaling behavior, and gives hints as to how one may exploit the full potential for saving energy with minimal concessions regarding performance.

PDF