Stanislav Bratanov - Intel / Software Tools
VTune as a bridge between performance, parallelism and power analysis: the use of control flow tree model (call stacks and loops to retrieve a wider context when there’s no single 90% hotspot to optimize), the synchronization profile (detection of wait spots and the cost of synchronization), and then, logically, power analysis (locating power-inefficient synchronizations – wait rate plus C-state residencies, and the cost of everything in Joules).
http://software.intel.com/en-us/intel-vtune-amplifier-xe
Stephane Eranian - Google, Linux
The perf_events subsystem is now the Linux standard interface for performance monitoring of applications and whole systems. It provides full access to the hardware performance counters, kernel software events and tracing. In this tutorial, we give a quick overview of the interface. Next, we describe the current tool landscape. We start with the reference tool shipped by all distributions called perf. It enables counting, profiling, and tracing. For more sophisticated cycle breakdown analysis, we then demonstrate Google's Gooda tool which takes profiles collected by the perf and generates function, source, assembly level cycle breakdown analysis all displayed via a web-based interface. We end presentation with a Q&A session.
Stephane Eranian is a senior software engineer at Google where he is working in the Linux kernel team. For almost a decade now, he has been developing performance monitoring interfaces and tools. He is an important contributor to the perf_events and perf tool projects. Prior to joining Google, he was working at HPLabs on the port of Linux to Itanium. He is the co-author of the book entitled “IA-64 Linux kernel: design and implementation”.
http://code.google.com/p/gooda/
Andrzej Nowak - CERN openlab and EPFL
This talk will cover various observations related to profiling activities on large scientific applications – namely the software that works for the Large Hadron Collider at CERN and on the LHC computing grid. The characteristics and footprints of said applications will be briefly presented, as well as some observed effects on the microarchitecture – all taking into account the specifics of the international collaborations that write and maintain the codebase. The relevance of legacy performance monitoring methods employed over the past 5-7 years will be reviewed in the context of modern optimization challenges on complex C++ code. In particular, the usefulness of currently used tools, workflows and methodologies will be discussed, as well as the obstacles to getting actionable information.
Ahmad Yasin - Intel / Processor Architecture
Optimizing an application’s performance for a given micro-architecture (uarch) has become painfully difficult. Challenges include increasing CPU uarch complexity, workload diversity, and the unmanageable volume of data produced by performance tools. Top Down analysis methodology uses performance counters in a structured, hierarchical approach in order to quickly and, more importantly, correctly identify dominant performance bottlenecks. Besides, the Top Down use-case has facilitated practical, usage-driven refinement of hardware performance instrumentation for recent CPUs, replacing what has traditionally been a bottom-up, ad-hoc counter definition process.
Whitepaper: How to Tune Applications Using a Top-down Characterization of Micro-architectural Issues. Foils: on using-intel-vtune-amplifier-xe-on-xeon-e5-family-1.0.pdf
Intel Optimization Manual – documentation under Appendix B.
Michael Chynoweth, Rajshree Chabukswar - Intel / Software & Services Group
The presentation will cover the methodologies of the PBA (Performance Bottleneck Analyzer) toolset which has been maintained by Intel engineers for 7+ years on future architectures. Using examples from software vendors we will show how PBA was able to find and fix issues that could not be identified with any other methodology or toolset on Intel’s future SOCs. PBA recreates very long flows of execution on the processor and then combines knowledge of processor events and static assembly analysis to find and prioritize bottlenecks on Intel’s latest architectures. Filters are then applied to the data set to better call out issues that are impacting user experience, power or slow transactions to ensure the developer concentrates on the right issues. We will also showcase functionalities for power analysis using data from other Intel tools such as Intel® Power Gadget (power and frequency bucketing) on field examples. The talk will focus on how the collaborative framework has been used to share methodologies across multiple software vendor accounts and disciplines.
http://software.intel.com/en-us/articles/intel-performance-bottleneck-analyzer/
Cfir Aguston, Yosi Ben-Asher, Gadi Haber - University of Haifa
Tools that provide optimization hints for program developers are facing severe obstacles and often unable to provide meaningful guidance on how to parallelize real--life applications. The main reason is due to the high code complexity and its large size when considering commercially valuable code. Such code is often rich with pointers, heavily nested conditional statements, nested while--based loops filled with complicated control flow statements, function calls etc. These constructs prevent existing compiler analysis from extracting the full parallelization potential. We propose a new paradigm to overcome this issue by automatically transforming the code into a much simpler skeleton-like form that is more conductive for auto-parallelization. We then apply existing tools of source--level automatic parallelization on the skeletonized code in order to check for parallelization potential. The skeleton code, along with the parallelized version, are then provided to the programmer in the form of an IDE (Integrated Development Environment) recommendation.
Optimizing Code Advisor for Java source code on top of Eclipse JDT version 3.4
Optimizing Code Advisor 4.0 for C/C++ source code on top of Eclipse CDT Helios
Naftaly Shalev - Intel / Software tools
CPU architecture has evolved in recent years to multi-threaded and multi-core processors. With an ever-growing number of cores and threads in a processor, great effort is invested in making sure that the HW scales well. However, scaled HW does not necessarily mean scaled performance. The missing link is, as always, the SW. Most programs today are still mostly serial, single-threaded programs, and therefore cannot take advantage of the parallel power that modern processors have to offer. Furthermore, writing an effective parallel program can prove to be quite difficult and has many pitfalls to even the most qualified programmers. Advisor is a tool which guides the programmer, through a strict methodology, from serial thinking and serial programs to effective parallel programs. It assists the programmer in identifying parallelization potential and helps prevent common pitfalls. Advisor saves time since it predicts the performance gain of parallelizing your code, and warns about possible problems such as data races and inefficient locking schemas. In short, Advisor is your guide to effective parallel programming!