Publications

Copyright disclaimer

The documents contained in this page are included to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are retained by authors or other copyright holders. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. In most case, these works may not be reposted without explicit permission of the copyright holder.

[2024]

Bhargav Reddy Godala,  Sankara Prasad RameshGilles Pokam, Jared Stark, Andre Seznec, Dean Tullsen and David I. August.

In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS'24), La Jolla, CA, USA, April 2024.


Abstract

Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them. 

In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles – along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks. 

Yuxuan Zhang,  Nathan Sobotka, Soyoon Park, Saba Jamilan, Tanvir Khan,  Baris Kasikci, Gilles Pokam, Heiner Litz and Joseph Devietti.

In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS'24), La Jolla, CA, USA, April 2024.

Abstract

Data cache prefetching is a well-established optimization to overcome the limits of the cache hierarchy and keep the processor pipeline fed with data. In principle, accurate, well-timed prefetches can sidestep the majority of cache misses and dramatically improve performance. In practice, however, it is challenging to identify which data to prefetch and when to do so. In particular, data can be easily requested too early, causing eviction of useful data from the cache, or requested too late, failing to avoid cache misses. Competition for limited off-chip memory bandwidth must also be balanced between prefetches and a program’s regular “demand” accesses. Due to these challenges, prefetching can both help and hurt performance, and the outcome can depend on program structure, decisions about what to prefetch and when to do it, and, as we demonstrate in a series of experiments, program input, processor microarchitecture, and their interaction as well. 

To try to meet these challenges, we have designed the RPG^2 system for online prefetch injection and tuning. RPG^2 is a pure-software system that operates on running C/C++ programs, profiling them, injecting prefetch instructions, and then tuning those prefetches to maximize performance. Across dozens of inputs, we find that RPG2 can provide speedups of up to 2.15×, comparable to the best profile-guided prefetching compilers, but can also respond when prefetching ends up being harmful and roll back to the original code – something that static compilers cannot. RPG^2 improves prefetching robustness by preserving its performance benefits, while avoiding slowdowns. 




[2023]

Yuxuan Zhang, Tanvir Khan,  Gilles Pokam, Baris Kasikci, Heiner Litz and Joseph Devietti.

IEEE Micro Special Issue: Top Picks from Computer Architecture Conferences (Micro's Top Picks), July/Aug 2023.

Abstract

The processor front end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. In this article, we propose Online COde Layout OptimizationS (OCOLOS), the first online code layout optimization system for unmodified applications written in unmanaged languages. OCOLOS allows profile-guided optimization to be performed on a running process instead of being performed offline and requiring the application to be relaunched. Our experiments show that OCOLOS can accelerate MySQL by up to 41%. 

Nayana Prasad Nagendra, Bhargav Reddy Godala, Ishita Chaturvedi, Atmn Patel, Svilen Kanev, Tipp Moseley, Jared Stark, Gilles Pokam, Simone Campanoni, and David I. August 

In Proceedings of the 50th International Symposium on Computer Architecture (ISCA 2023), June. 2023.

Abstract

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. 

Satish Narayanasamy, Gilles Pokam and Brad Calder

In ISCA@50 Retrospective: 1996-2020. Edited by Jose F. Martinez and Lizy K. John, June 2023 

Selected for inclusion in ISCA@50 25-year Retrospective 1996-2020 

Abstract

BugNet was published in ISCA 2005. When we started this line of research in early to mid-2000, processor industry was at an inflection point. Performance was no longer the only consideration in processor design. Dennard scaling was breaking down, ushering in an era of multi-cores. The increasing complexity of software (e.g., Windows XP had 40-50 million lines of code) was already significantly straining programmer productivity. With the advent of multi-core processors, there arose an even more daunting challenge for most programmers: learning parallel programming, a task that was typically left to specialists. Consequently, there was a need within the computer science community to improve debugging as we transitioned to multi-core processors, given the programming challenges they presented. 

In this context, we asked ourselves: what can processors do to help improve programmer productivity? Processors had performance counters to analyze performance issues, but lacked sufficient support for program correctness reasoning and debugging, except for a limited number of watchpoint and breakpoint registers. 

Given that a significant fraction of a developer’s time is spent on debugging, we then asked: what if we could reserve a small memory space, and log some useful information during a program’s execution that could later help programmers debug in the event of a program failure? These questions led to further investigation that resulted in BugNet 

[2022]

Yuxuan Zhang, Tanvir Khan,  Gilles Pokam, Baris Kasikci, Heiner Litz and Joseph Devietti.

In Proceedings of the 55th International Symposium on Microarchitecture (MICRO-55), Oct. 2022.

Abstract

The processor front-end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. First-level instruction caches and branch prediction engines have not been able to keep up with this code growth, leading to more front-end stalls and lower Instructions Per Cycle (IPC). Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. If an application's code is frequently updated, it becomes challenging to map profiling data from a previous version onto the latest version, leading to ignored profiling data and missed optimization opportunities.

In this paper, we propose OCOLOS, the first online code layout optimization system for unmodified applications written in unmanaged languages. OCOLOS allows profile-guided optimization to be performed on a running process, instead of being performed offline and requiring the application to be relaunched. By running online, profile data is always relevant to the current execution and always maps perfectly to the running code. OCOLOS demonstrates how to achieve robust online code replacement in complex multithreaded applications like MySQL and MongoDB, without requiring any application changes. Our experiments show that OCOLOS can accelerate MySQL by up to 1.41×, the Verilator hardware simulator by up to 2.20×, and a build of the Clang compiler by up to 1.14×.

Samira Mirbagher, Daniel Moghimi,  Jeffrey Collins,  Gilles Pokam, Nael Abu-Ghazaleh, and Dean Tullsen.

In Proceedings of the 55th International Symposium on Microarchitecture (MICRO-55), Oct. 2022.

Abstract

This paper provides an end-to-end solution to defend against known microarchitectural attacks such as speculative execution attacks, fault-injection attacks, covert and side channel attacks, and unknown or evasive versions of these attacks. Current defenses are attack specific and can have unacceptably high performance overhead. We propose an approach that reduces the overhead of state-of-art defenses by over 95%, by applying defenses only when attacks are detected. Many current proposed mitigations are not practical for deployment; for example, InvisiSpec has 27% overhead and Fencing has 74% overhead while protecting against only Spectre attacks. Other mitigations carry similar performance penalties. We reduce the overhead for InvisiSpec to 1.26% and for Fencing to 3.45% offering performance and security for not only Spectre attacks but other known transient attacks as well, including the dangerous class of LVI and Rowhammer attacks, as well as covering a large set of future evasive and zero-day attacks.

Critical to our approach is an accurate detector that is not fooled by evasive attacks and that can generalize to novel zero-day attacks. We use a novel Generative framework, Evasion Vaccination ( EVAX) for training ML models and engineering new security-centric performance counters. EVAX significantly increases sensitivity to detect and classify attacks in time for mitigation to be deployed with low false positives (4 FPs in every 1M instructions in our experiments). Such performance enables efficient and timely mitigations, enabling the processor to automatically switch between performance and security as needed.

[2021]

Tanvir Khan, Nathan Brown, Akshitha Sriraman, Niranjan Soundararajan, Rakesh Kumar,  Joseph Devietti, Sreenivas Subramoney, Gilles Pokam, Heiner Litz, and Baris Kasikci.

In Proceedings of the 54th International Symposium on Microarchitecture (MICRO-54), Oct. 2021.

Abstract

Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application footprints, modern server-class processors implement a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP). In this work, we first characterize the limitations of a decoupled frontend processor with FDIP and find that FDIP suffers from significant Branch Target Buffer (BTB) misses. We also find that existing techniques (e.g., stream prefetchers and predecoders) are unable to mitigate these misses, as they rely on an incomplete understanding of a program’s branching behavior. 

To address the shortcomings of existing BTB prefetching techniques, we propose Twig, a novel profile-guided BTB prefetching mechanism. Twig analyzes a production binary’s execution profile to identify critical BTB misses and inject BTB prefetch instructions into code. Additionally, Twig coalesces multiple non-contiguous BTB prefetches to improve the BTB’s locality. Twig exposes these techniques via new BTB prefetch instructions. Since Twig prefetches BTB entries without modifying the underlying BTB organization, it is easy to adopt in modern processors. We study Twig’s behavior across nine widely-used data center applications, and demonstrate that it achieves an average 20.86% (up to 145%) performance speedup over a baseline 8K-entry BTB, outperforming the state-of-the-art BTB prefetch mechanism by 19.82% (on average). 

Tanvir Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci

In Proceedings of the 48th International Symposium on Computer Architecture (ISCA 2021), June. 2021.

Abstract

Modern data center applications exhibit deep software stacks, resulting in large instruction footprints that frequently cause instruction cache misses degrading performance, cost, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced cache line evictions that are not handled by existing replacement policies. Existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application's complex program behavior.

To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program contexts that lead to I-cache misses and sparingly injects "cache line eviction" instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) over prior work due to a mean 19% (up to 28.6%) I-cache miss reduction.

Tanvir Khan, Ian Neal, Gilles Pokam, Barzan Mozafari and Baris Kasikci

In Proceedings of the 15th Symposium on Operating Systems Design and Implementation (OSDI'21), July. 2021.

Abstract

Poor data locality hurts an application’s performance. While compiler-based techniques have been proposed to improve data locality, they depend on heuristics, which can sometimes hurt performance. Therefore, developers typically find data locality issues via dynamic profiling and repair them manually. Alas, existing profiling techniques incur high overhead and cannot be deployed in production, where programs may exhibit previously-unseen performance problems.

We present selective profiling, a technique that locates data locality problems with low-enough overhead that is suitable for production use. To achieve low overhead, selective profiling gathers runtime execution information selectively and incrementally. Using selective profiling, we build DMon, a system that can automatically locate data locality problems in production, identify access patterns that hurt locality, and repair such patterns using targeted optimizations.

Thanks to selective profiling, DMon’s profiling overhead is 1.36% on average, making it feasible for production use. DMon’s targeted optimizations provide 16.83% speedup on average (up to 53.14%), compared to a baseline that uses the highest level of compiler optimization. DMon speeds up PostgreSQL, one of the most popular database systems, by 6.64% on average (up to 17.48%).

[2020]

Tanvir Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci

In Proceedings of the 53rd International Symposium on Microarchitecture (MICRO-53), Athens, Greece, Oct. 2020.

Abstract

Modern data center applications have rapidly expanding instruction footprints that lead to frequent instruction cache misses, increasing cost and degrading data center performance and energy efficiency. Mitigating instruction cache misses is challenging since existing techniques (1) require significant hardware modifications, (2) expect impractical on-chip storage, or (3) prefetch instructions based on inaccurate understanding of program miss behavior. 

To overcome these limitations, we first investigate the challenges of effective instruction prefetching. We then use insights derived from our investigation to develop I-SPY, a novel profile-driven prefetching technique. I-SPY uses dynamic miss profiles to drive an offline analysis of I-cache miss behavior, which it uses to inform prefetching decisions. Two key techniques underlie I-SPY’s design: (1) conditional prefetching, which only prefetches instructions if the program context is known to lead to misses, and (2) prefetch coalescing, which merges multiple prefetches of non-contiguous cache lines into a single prefetch instruction. I-SPY exposes these techniques via a family of light-weight hardware code prefetch instructions. 

We study I-SPY in the context of nine data center applications and show that it provides an average of 15.5% (up to 45.9%) speedup and 95.9% (and up to 98.4%) reduction in instruction cache misses, outperforming the state-of-the-art prefetching technique by 22.5%. We show that I-SPY achieves performance improvements that are on average 90.5% of the performance of an ideal cache with no misses. 

Samira Mirbagher, Elba Garza, Gilles Pokam, and Daniel Jiménez

In Proceedings of the 53rd International Symposium on Microarchitecture (MICRO-53), Athens, Greece, Oct. 2020.

Abstract

Translation Lookaside Buffers (TLBs) play a critical role in hardware-supported memory virtualization. To speed up address translation and reduce costly page table walks, TLBs cache a small number of recently-used virtual-to-physical address translations. TLBs must make the best use of their limited capacities. Thus, TLB entries with low potential for reuse should be replaced by more useful entries. This paper contributes to an aspect of TLB management that has received little attention in the literature: replacement policy. We show how predictive replacement policies can be tailored toward TLBs to reduce miss rates and improve overall performance. 

We begin by applying recently proposed predictive cache replacement policies to the TLB. We show these policies do not work well without considering specific TLB behavior. 

Next, we introduce a novel TLB-focused predictive policy, Control-flow History Reuse Prediction (CHiRP). This policy uses a history signature and replacement algorithm that correlates to known TLB behavior, outperforming other policies. 

For a 1024-entry 8-way set-associative L2 TLB with a 4KB page size, we show that CHiRP reduces misses per 1000 instructions (MPKI) by an average 28.21% over the least-recently-used (LRU) policy, outperforming Static Re-reference Interval Prediction (SRRIP) [44], Global History Reuse Policy (GHRP) [73] and SHiP [61], which reduce MPKI by an average of 10.36%, 9.03% and 0.88%, respectively. 

Samira Mirbagher, Gilles Pokam, Esmaeil Koruyeh, Elba Garza, Nael Abu-Ghazaleh, and Daniel Jiménez 

In Proceedings of the 53rd International Symposium on Microarchitecture (MICRO-53), Athens, Greece, Oct. 2020.

Abstract

Detecting microarchitectural attacks is critical given their proliferation in recent years. Many of these attacks exhibit intrinsic behaviors essential to the nature of their operation, such as creating contention or misspeculation. This study systematically investigates the microarchitectural footprints of hardware-based attacks and shows how they can be detected and classified using an efficient hardware predictor. We present a methodology to use correlated microarchitectural statistics to design a hardware-based neural predictor capable of detecting and classifying microarchitectural attacks before data is leaked. Once a potential attack is detected, it can be proactively mitigated by triggering appropriate countermeasures. Our hardware-based detector, PerSpectron, uses perceptron learning to identify and classify attacks. Perceptron-based prediction has been successfully used in branch prediction and other hardware-based applications. PerSpectron has minimal performance overhead. The statistics being monitored have similar overhead to already existing performance monitoring counters. Additionally, PerSpectron operates outside the processor’s critical paths, offering security without added computation delay. Our system achieves a usable detection rate for detecting attacks such as SpectreV1, SpectreV2, SpectreRSB, CacheOut, Meltdown, breakingKSLR, Flush+Flush, Flush+Reload, Prime+Probe as well as cache-attack calibration programs. We also believe that the large number of diverse microarchitectural features offers both evasion resilience and interpretability, features not present in previous hardware security detectors. We detect these attacks early enough to avoid any data leakage, unlike previous work that triggers countermeasures only after data has been exposed. 

Christian DeLozier, Kavya Lakshminarayanan, Gilles Pokam, and Joseph Devietti. 

In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2020), Lausanne, Switzerland, March 2020.

Abstract

Code-reuse attacks represent the state-of-the-art in exploiting memory safety vulnerabilities. Control-flow integrity techniques offer a promising direction for preventing code-reuse attacks, but these attacks are resilient against imprecise and heuristic-based detection and prevention mechanisms. 

In this work, we propose a new context-sensitive control-flow integrity system (Hurdle) that guarantees pairwise gadgets cannot be chained in a code-reuse attack. Hurdle improves upon prior techniques by using SMT constraint solving to ensure that indirect control flow transfers cannot be maliciously redirected to execute gadget chains. At the same time, Hurdle’s security policy is flexible enough that benign executions are only rarely mischaracterized as malicious. When such mischaracterizations occur, Hurdle can generalize its constraint solving to avoid these mischaracterizations at low marginal cost. 

We propose architecture extensions for Hurdle which consist of an extended branch history register and new instructions. Thanks to its hardware support, Hurdle enforces a context-sensitive control-flow integrity policy with 1.02% average runtime overhead. 

[2019]

Tanvir Ahmed Khan, Yifan Zhao, Gilles Pokam, Barzan Mozafari and Baris Kasikci. 

In Proceedings of the 40th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019), Phoenix, AZ, USA, June 2019.

Abstract

Writing efficient multithreaded code that can leverage the full parallelism of underlying hardware is difficult. A key impediment is insidious cache contention issues, such as false sharing. False sharing occurs when multiple threads from different cores access disjoint portions of the same cache line, causing it to go back and forth between the caches of different cores and leading to substantial slowdown. 

Alas, existing techniques for detecting and repairing false sharing have limitations. On the one hand, in-house (i.e., offline) techniques are limited to situations where falsely shared data can be determined statically, and are otherwise inaccurate. On the other hand, in-production (i.e., run-time) techniques incur considerable overhead, as they constantly monitor a program to detect false sharing. In-production repair techniques are also limited by the types of modifications they can perform on the fly, and are therefore less effective. 

We present Huron, a hybrid in-house/in-production false sharing detection and repair system. Huron detects and repairs as much false sharing as it can in-house, and relies on its lightweight in-production mechanism for remaining cases. The key idea behind Huron’s in-house false sharing repair is to group together data that is accessed by the same set of threads, to shift falsely-shared data to different cache lines. Huron’s in-house repair technique can generalize to previously-unobserved inputs. Our evaluation shows that Huron can detect more false sharing bugs than all state-of-the-art techniques, and with a lower overhead. Huron improves runtime performance by 3.82× on average (up to 11×), which is 2.11-2.27× better than the state of the art. 

[2018]

Animesh Jain, Michael Laurenzano, Gilles Pokam, Jason Mars, and Lingjia Tang. 

In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT 2018), Limassol, Cyprus, Nov. 2018.

Abstract

A key focus of recent work in our community has been on devising increasingly sophisticated acceleration devices for deep neural network (DNN) computation, especially for networks driven by convolution layers. Yet, despite the promise of substantial improvements in performance and energy consumption offered by these approaches, general purpose computing is not going away because its traditional well-understood programming model and continued wide deployment. Therefore, the question arises as to what can be done, if anything, to evolve conventional CPUs to accommodate efficient deep neural network computation. 

This work focuses on the challenging problem of identifying and alleviating the performance bottlenecks for convolution layer computation for conventional CPU platforms. We begin by performing a detailed study of a range of CNN-based applications on a modern CPU microarchitecture, finding that designing a physical register file (PRF) capable of feeding computational units is the primary barrier that prevents the addition of more compute units in the CPU, limiting the performance improvements that can be achieved by CPU on convolution layers. We present the design of a novel, minimally intrusive set of microarchitectural and ISA extensions that address this problem and describe the code generation support needed to take advantage our design. Through a detailed evaluation that covers 5 state-of-the-art neural network applications, we observe that applying these extensions allows packing more compute in the CPU while keeping PRF energy in check, achieving a 2× performance improvement and a 2.7× energy-delay product improvement against a popular Intel Haswell server processor baseline. 

Steven Margerm, Amirali Sharifian, Apala Guha, Gilles Pokam and Arrvindh Shriraman.

In Proceedings of the 51th International Symposium on Microarchitecture (MICRO-51), Fukuoka, Japan, Oct. 2018.

Abstract

High-level-synthesis (HLS) tools generate accelerators from software programs to ease the task of building hardware. Unfortunately, current HLS tools have limited support for concurrency, which impacts the speedup achievable with the generated accelerator. Current approaches only target fixed static patterns (e.g., pipeline, data-parallel kernels). This constraints the ability of software programmers to express concurrency. Moreover, the generated accelerator loses a key benefit of parallel hardware, dynamic asynchrony, and the potential to hide long latency and cache misses. 

We have developed TAPAS, an HLS toolchain for generating parallel accelerators from programs with dynamic parallelism. TAPAS is built on top of Tapir [22], [39], which embeds fork-join parallelism into the compiler’s intermediate-representation. TAPAS leverages the compiler IR to identify parallelism and synthesizes the hardware logic. TAPAS provides first-class architecture support for spawning, coordinating and synchronizing tasks during accelerator execution. We demonstrate TAPAS can generate accelerators for concurrent programs with heterogeneous, nested and recursive parallelism. Our evaluation on Intel-Altera DE1-SoC and Arria-10 boards demonstrates that TAPAS generated accelerators achieve 20× the power efficiency of an Intel Xeon, while maintaining comparable performance. We also show that TAPAS enables lightweight tasks that can be spawned in 10 cycles and enables accelerators to exploit available fine-grain parallelism. TAPAS is a complete HLS toolchain for synthesizing parallel programs to accelerators and is open-sourced. 

[2017]

Christian DeLozier, Ariel Eizenberg, Shiliang Hu, Gilles Pokam and Joseph Devietti. 

In Proceedings of the 50th International Symposium on Microarchitecture (MICRO-50), Boston, USA, Oct. 2017.

Abstract

Cache contention in the form of false sharing and true sharing arises when threads overshare cache lines at high frequency. Such oversharing can reduce or negate the performance benefits of parallel execution. Prior systems for detecting and repairing cache contention lack efficiency in detection or repair, contain subtle memory consistency flaws, or require invasive changes to the program environment. 

In this paper, we introduce a new way to combat cache line oversharing via the Thread Memory Isolation (TMI) system. TMI operates completely in user space, leveraging performance counters and the Linux ptrace mechanism to tread lightly on monitored applications, intervening only when necessary. TMI’s compatible-by-default design allows it to scale to real-world workloads, unlike previous proposals. TMI introduces a novel code-centric consistency model to handle cross-language memory consistency issues. TMI exploits the flexibility of code-centric consistency to efficiently repair false sharing while preserving strong consistency model semantics when necessary. 

TMI has minimal impact on programs without oversharing, slowing their execution by just 2% on average. We also evaluate TMI on benchmarks with known false sharing, and manually inject a false sharing bug into the leveldb key-value store from Google. For these programs, TMI provides an average speedup of 5.2x and achieves 88% of the speedup possible with manual source code fixes. 

Florian Haas, Sebastian Weis, Theo Ungerer, Gilles Pokam and Youfeng Wu.

In Proceedings of the 30th International Conference on Architecture of Computing Systems (ARCS 2017), Vienna, Austria, April 2017.

Abstract

The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family. We leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback. Redundant execution of processes and signature-based comparison of their computations provides error detection, and transactional wrapping enables error recovery. Existing applications are enhanced towards fault-tolerant redundant execution by post-link binary instrumentation. Hardware enhancements to further increase the applicability of the approach are proposed and evaluated with SPEC CPU 2006 benchmarks. The resulting performance overhead is 47% on average, assuming the existence of the proposed hardware support. 

Chun-Hung Hsiao, Satish Narayanasamy, Essam Muhammad Idris, Cristiano Pereira and Gilles Pokam. 

In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017), Xi'an, China, April 2017.

Abstract

Asynchronous programming model is commonly used in mobile systems and Web 2.0 environments. Asynchronous race detectors use algorithms that are an order of magnitude performance and space inefficient compared to conventional data race detectors. We solve this problem by identifying and addressing two important problems in reasoning about causality between asynchronous events. 

Unlike conventional signal-wait operations, establishing causal order between two asynchronous events is fundamentally more challenging as there is no common handle they operate on. We propose a new primitive named ASYNCCLOCK that addresses this problem by explicitly tracking causally preceding events, and show that ASYNCCLOCK can handle a wide variety of asynchronous causality models. We also address the important scalability problem of efficiently identifying heirless events whose metadata can be reclaimed. 

We built the first single-pass, non-graph-based Android race detector using our algorithm and applied it to find errors in 20 popular applications. Our tool incurs about 6x performance overhead, which is several times more efficient than the state-of-the-art solution. It also scales well with the execution length. We used our tool to find 147 previously unknown harmful races. 

[2016] 

Ariel Eizenberg, Shiliang Hu, Gilles Pokam and Joseph Devietti. 

In Proceedings of the 37th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2016), Santa Barbara, CA, USA, June 2016.

Abstract

As ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged languages have employed highly-invasive runtime program modifications. We observe that managed language runtimes, with garbage collection and JIT code compilation, present unique opportunities to repair such bugs directly, mirroring the techniques used in manual repairs. 

We present REMIX, a modified version of the Oracle HotSpot JVM which can detect cache contention bugs and repair false sharing at runtime. REMIX’s detection mechanism leverages recent performance counter improvements on Intel platforms, which allow for precise, unobtrusive monitoring of cache contention at the hardware level. REMIX can detect and repair known false sharing issues in the LMAX Disruptor high-performance inter-thread messaging library and the Spring Reactor event-processing framework, automatically providing 1.5-2x speedups over unoptimized code and matching the performance of hand optimization. REMIX also finds a new false sharing bug in SPECjvm2008, and uncovers a true sharing bug in the HotSpot JVM that, when fixed, improves the performance of three NAS Parallel Benchmarks by 7-25x. REMIX incurs no statistically-significant performance overhead on other benchmarks that do not exhibit cache contention, making REMIX practical for always-on use. 

Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Lu, Gilles Pokam, Chris Newburn and Joseph Devietti.

In Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA 2016), Barcelona, Spain, March 2016.

Abstract

Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program’s actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment. 

This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel’s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware. 

[2015]            

Yujie Liu, Justin Gottschlich, Gilles Pokam, and Michael Spear. 

In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT 2015), San Francisco, CA, USA, October 2015.

Abstract

The availability of commercial hardware transactional memory (TM) systems has not yet been met with a rise in the number of large-scale programs that use memory transactions explicitly. A significant impediment to the use of TM is the lack of tool support, specifically profilers that can identify and explain performance anomalies. In this paper, we introduce an end-to-end system that enables lowoverhead performance profiling of large-scale transactional programs. We present algorithms and an implementation for Intel’s Haswell processors. With our system, it is possible to record a transactional program’s execution with minimal overhead, and then replay it within a custom profiling tool to identify causes of contention and aborts, down to the granularity of individual memory accesses. Evaluation shows that our algorithms have low overhead, and our tools enable programmers to effectively explain performance anomalies. 

Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam and George Candea.

In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP'15), Monterey, CA, October 2015.

Abstract

Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice. 

We propose failure sketching, an automated debugging technique that provides developers with an explanation (“failure sketch”) of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis. 

We built Gist, a prototype for failure sketching that relies on hardware watch-points and a new hardware feature for extracting control flow traces (Intel Processor Trace). We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached. 

Baris Kasikci, Cristiano Pereira, Gilles Pokam, Benjamin Schubert and George Candea.

In 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS XV), Kartause Ittingen, Switzerland, May 2015

Abstract

One of the main reasons debugging is hard and time consuming is that existing debugging tools do not provide an explanation for the root causes of failures. Additionally, existing techniques either rely on expensive runtime recording or assume existence of a given program input that reliably reproduces the failure, which makes them hard to apply in production scenarios. Consequently, developers spend precious time chasing elusive bugs, resulting in productivity loss.

We propose a new debugging technique, called failure sketching, that provides the developer with a high-level explanation for the root cause of a failure. A failure sketch achieves this goal because: 1) it only contains program statements that cause a failure; 2) it shows which program properties differ between failing and successful executions. We argue that failure sketches can be built by combining in-house static analysis and crowdsourced dynamic analysis. For building a failure sketch, we do not assume that developers can reproduce the failure. We show preliminary evidence that failure sketches can significantly improve programmer productivity.

[2014]

Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, and Maurice Herlihy. 

In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT 2014), Alberta, Canada, August 2014

Abstract

The Intel Haswell processor includes restricted transactional memory (RTM), which is the first commodity-based hardware transactional memory (HTM) to become publicly available. However, like other real HTMs, such as IBM’s Blue Gene/Q, Haswell’s RTM is best-effort, meaning it provides no transactional forward progress guarantees. Because of this, a software fallback system must be used in conjunction with Haswell’s RTM to ensure transactional programs execute to completion. To complicate matters, Haswell does not provide escape actions. Without escape actions, nontransactional instructions cannot be executed within the context of a hardware transaction, thereby restricting the ways in which a software fallback can interact with the HTM. As such, the challenge of creating a scalable hybrid TM (HyTM) that uses Haswell’s RTM and a software TM (STM) fallback is exacerbated. 

In this paper, we present Invyswell, a novel HyTM that exploits the benefits and manages the limitations of Haswell’s RTM. After describing Invyswell’s design, we show that it outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid NOrec, NOrec’s hybrid implementation, by 18%, and Haswell’s hardware-only lock elision by 25% across all STAMP benchmarks.

Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano Pereira, Gilles Pokam, Peter Chen, and Jason Flinn. 

In Proceedings of the 35th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2014), Edinburgh, UK, June 2014

Abstract

Mobile systems commonly support an event-based model of concurrent programming. This model, used in popular platforms such as Android, naturally supports mobile devices that have a rich array of sensors and user input modalities. Unfortunately, most existing tools for detecting concurrency errors of parallel programs focus on a thread-based model of concurrency. If one applies such tools directly to an event-based program, they work poorly because they infer false dependencies between unrelated events handled sequentially by the same thread. 

In this paper we present a race detection tool named CAFA for event-driven mobile systems. CAFA uses the causality model that we have developed for the Android event-driven system. A novel contribution of our model is that it accounts for the causal order due to the event queues, which are not accounted for in past data race detectors. Detecting races based on low-level races between memory accesses leads to a large number of false positives. CAFA overcomes this problem by checking for races between high-level operations. We discuss our experience in using CAFA for finding and understanding a number of known and unknown harmful races in open-source Android applications. 

Irina Calciu, Tatiana Shpeisman, Gilles Pokam and Maurice Herlihy. 

In Proceedings of the 9th ACM SIGPLAN Workshop on Transactional Computing (TRANSACT 2014), Salt Lake City, Utah, USA, March 2014.

Abstract

Intel’s Haswell and IBM’s Blue Gene/Q and System Z are the first commercially available systems to include hardware transactional memory (HTM). However, they are all best-effort, meaning that every hardware transaction must have an alternative software fallback path that guarantees forward progress. The simplest and most widely used software fallback is a single global lock (SGL), in which aborted hardware transactions acquire the SGL before they are re-executed in software. Other hardware transactions need to subscribe to this lock and abort as soon as it is acquired. This approach, however, causes many hardware transactions to abort unnecessarily, determining even more transactions to fail and resort to the SGL. 

In this paper we suggest improvements to the simple SGL fallback. First, we use lazy subscription to reduce the rate of SGL acquisitions. Next, we propose fine-grained conflict detection mechanisms between hardware transactions and a software SGL transaction. Finally, we describe how our findings can be used to improve future generations of HTMs. 

[2013]

Justin Gottschlich, Gilles Pokam, Cristiano Pereira and Youfeng Wu. 

In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT 2013), Edinburgh, Scotland, Sept 2013.

Abstract

To reduce the complexity of debugging multithreaded programs, researchers have developed many techniques that automatically detect bugs that arise from shared memory errors. These techniques can identify a wide range of bugs, but it can be challenging for a programmer to reproduce a specific bug that he or she is interested in using such techniques. This is because these techniques were not intended for individual bug reproduction but rather an exploratory search for possible bugs. 

To address this concern we present concurrent predicates (CPs) and concurrent predicate expressions (CPEs), which allow programmers to single out a specific bug by specifying the schedule and program state that must be satisfied for the bug to be reproduced. We present the recipes, that is, the mechanical processes, we use to reproduce data races, atomicity violations, and deadlocks with CP and CPE. We then show how these recipes apply to the diagnosis and reproduction of bugs from 13 handcrafted bugs, five real-world application bugs from RADBench, and three previously unresolved bugs from TBoost.STM, which now includes the fixes we generated using CP and CPE. 

Milos Gligoric, Lingming Zhang, Cristiano Pereira and Gilles Pokam. 

In International Symposium on Software Testing and Analysis (ISSTA 2013), Lugano, Switzerland, July 2013.

Abstract

Concurrent code is becoming increasingly important with the advent of multi-cores, but testing concurrent code is challenging. Researchers are developing new testing techniques and test suites for concurrent code, but evaluating these techniques and test suites often uses a small number of real or manually seeded bugs. 

Mutation testing allows creating a large number of buggy programs to evaluate test suites. However, performing mutation testing is expensive even for sequential code, and the cost is higher for concurrent code where each test has to be executed for many (possibly all) thread schedules. The most widely used technique to speed up mutation testing is selective mutation, which reduces the number of mutants by applying only a subset of mutation operators such that test suites that kill all mutants generated by this subset also kill (almost) all mutants generated by all mutation operators. To date, selective mutation has been used only for sequential mutation operators. 

This paper explores selective mutation for concurrent mutation operators. Our results identify several sets of concurrent mutation operators that can effectively reduce the number of mutants, show that operator-based selection is slightly better than random mutant selection, and show that sequential and concurrent mutation operators are independent, demonstrating the importance of studying concurrent mutation operators. 

Justin Gottschlich, Rob Knauerhase and Gilles Pokam. 

In 5th USENIX Workshop on Hot Topics in Parallelism (HotPar 2013), San Jose, CA, USA, June 2013.

Abstract

With recent announcements of hardware transactional memory (HTM) systems from IBM and Intel, HTM will soon be available for wide-scale adoption. Such platforms, combined with tested and stable software transactional memory systems, are likely to make real transactional memory (TM) systems available for the first time, which promises to be a more attractive alternative than lock-based parallel programming in terms of programmability and performance. 

With these first-ever real systems come several open questions. Perhaps one of the most obvious is, “how does one debug a TM program that uses real hardware?” While prior research in this area exists, there are, to the best of our knowledge, no commercially-available TM debuggers and only a handful of research projects exploring such possibilities, many of which use simulated HTMs that may utilize unrealistic hardware. In this paper, we motivate the need for more than traditional and ad hoc debugging support. We then propose a novel record-and-replay system, based on existing research prototypes and demonstrate its usefulness by reviewing several use cases in TM programming, restricting use to real features in IBM and Intel’s HTMs.

Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich, Nima Honarmand, Nathan Dautenhahn, Samuel King and Josep Torrellas 

In Proceedings of the 40th International Symposium on Computer Architecture (ISCA 2013), Tel-Aviv, Israel, June 2013.

Abstract

There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. 

This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. 

This paper’s focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR. 

Nima Honarmand, Nathan Dautenhahn, Samuel King, Josep Torrellas, Gilles Pokam and Cristiano Pereira. 

In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2013), Houston, Texas, USA, March 2013.

Abstract

Architectures for deterministic record-replay (R&R) of multithreaded code are attractive for program debugging, intrusion analysis, and fault-tolerance uses. However, very few of the proposed designs have focused on maximizing replay speed — a key enabling property of these systems. The few efforts that focus on replay speed require intrusive hardware or software modifications, or target whole-system R&R rather than the more useful application-level R&R. 

This paper presents the first hardware-based scheme for unintrusive, application-level R&R that explicitly targets high replay speed. Our scheme, called Cyrus, requires no modification to commodity snoopy cache coherence. It introduces the concept of an on-the-fly software Backend Pass during recording which, as the log is being generated, transforms it for high replay parallelism. This pass also fixes-up the log, and can flexibly trade-off replay parallelism for log size. We analyze the performance of Cyrus using full system (OS plus hardware) simulation. Our results show that Cyrus has negligible recording overhead. In addition, for 8-processor runs of SPLASH-2, Cyrus attains an average replay parallelism of 5, and a replay speed that is, on average, only about 50% lower than the recording speed. 

[2012]

Jie Yu, Satish Narayanasamy, Cristiano Pereira and Gilles Pokam. 

In Proceedings of the 27th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA 2012), Tucson, AZ, USA, October 2012.

Abstract

Testing multithreaded programs is a hard problem, because it is challenging to expose those rare interleavings that can trigger a concurrency bug. We propose a new thread interleaving coverage-driven testing tool called Maple that seeks to expose untested thread interleavings as much as possible. It memoizes tested interleavings and actively seeks to expose untested interleavings for a given test input to increase interleaving coverage. 

We discuss several solutions to realize the above goal. First, we discuss a coverage metric based on a set of interleaving idioms. Second, we discuss an online technique to predict untested interleavings that can potentially be exposed for a given test input. Finally, the predicted untested interleavings are exposed by actively controlling the thread schedule while executing for the test input. We discuss our experiences in using the tool to expose several known and unknown bugs in real-world applications such as Apache and MySQL. 

Justin Gottschlich, Maurice Herlihy, Gilles Pokam and Jeremy Siek. 

In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT 2012), Minneapolis, MN, USA, Sept 2012.

Abstract

This paper presents TMProf, a transactional memory (TM) profiler, based on three visualization principles. These principles are (i) the precise graphical representation of transaction interactions including cross-correlated information and source code, (ii) visualized soft real-time playback of concurrently executing transactions, and (iii) dynamic visualizations of multiple executions. We describe how these principles break new ground and create new challenges for TM profilers. 

We discuss our experience using TMProf with InvalSTM, a state-of-the-art software TM, and show how TMProf’s feedback led to the design of two new contention managers (CMs). We demonstrate the performance benefits of these CMs, which generally led to improved performance as the amount of work and threads increase per benchmark. Our experimental results show that iBalanced, one of our newly designed CMs, can increase transaction throughput by nearly 10× over iFair, InvalSTM’s previously best performing CM. 

Justin Gottschlich, Gilles Pokam and Cristiano Pereira. 

In 4th USENIX Workshop on Hot Topics in Parallelism (HotPar 2012), Berkeley, CA, USA, June 2012.

Abstract

To reduce the complexity of debugging multithreaded programs, researchers have developed compile- and run-time techniques that automatically detect concurrency bugs. These techniques can identify a wide range of shared memory errors, but are sometimes impractical because they produce many false positives making it difficult to triage and reproduce specific bugs. To address these concerns, we introduce a control structure, called concurrent predicate (CP), which allows programmers to single out a specific bug by specifying the conditions that must be satisfied for the bug to be triggered. Using bugs from a test suite of 23 programs, applications from RADBench, and TBoost.STM, we show how CP is used to diagnose and reproduce such bugs that could not otherwise be reproduced using similar techniques. 

[2011]

Gilles Pokam, Cristiano Pereira, Shiliang Hu, Ali-Reza Adl-Tabatabai, Justin Gottschlich, Jungwoo Ha, and Youfeng Wu.

In Proceedings of the 44th International Symposium on Microarchitecture (MICRO-44), Porto Alegre, Brazil, Dec. 2011.

Abstract

Shared memory multiprocessors are difficult to program because of the non-deterministic ways in which the memory operations from different threads interleave. To address this issue, many hardware-based memory race recorders have been proposed that efficiently log an ordering of the shared memory interleavings between threads for deterministic replay. These approaches are challenging to integrate into current processors because they change the cache subsystem or the coherence protocol, and they mostly support a sequentially consistent memory model. 

In this paper, we describe CoreRacer, a chunk-based memory race recorder architecture for multicore x86 TSO processors. CoreRacer does not modify the cache subsystem and yet it still integrates into the x86 TSO memory model. We show that by leveraging a specific x86 feature, the invariant timestamp, CoreRacer maintains ordering among chunks without piggybacking on cache coherence messages. We provide a detailed implementation and evaluation of CoreRacer on a cycle-accurate x86 simulator. We show that its integration cost into x86 is minimal and its overhead has negligible effect on performance. 

Nicholas Jalbert, Cristiano Pereira, Gilles Pokam and Koushik Sen. 

In 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar 2011), Berkeley, CA, USA, May 2011.

Abstract

Testing and debugging tools for concurrent programs are often validated on known bugs. To aid the development of these tools, we present the Race, Atomicity, and Deadlock Benchmark (RADBench) suite. The RADBench suite contains the full source of 10 real concurrency bugs found in large open-source software projects including Mozilla SpiderMonkey, Mozilla NSPR, Memcached, Apache Web Server, and Google Chromium Web Browser. We discuss the difficulties we have found in reproducing these bugs that must be accounted for when building testing and debugging tools. Finally, we propose an approach to reproducibility that has a number of benefits over standard deterministic replay for debugging. RADBench is open source and publicly available. 

[2010]

Rakesh Kumar, Tim Mattson, Gilles Pokam, and Rob Van Der Wijngaart. 

In Multiprocessor System-on-Chip: Hardware Design and Tool Integration, edited by Jurgen Becker and Michael Hubner. Springer Verlag. December 2010.

Abstract

The debate over shared memory versus message passing programming models has raged for decades, with fairly cogent arguments on both sides. In this paper, we revisit this debate for multi-core chips and argue that message passing programming models are often more suitable than shared memory models for satisfying the unique goals presented by the many-core era. 

The many-core era is different. The nature of programmers, the nature of applications, and the nature of the computing substrate are different for multi-core chips than the traditional parallel machines that drove the parallel programming debate in the past. Specifically, while traditional parallel computers were programmed by highly educated scientists, multi-core chips will be programmed by mainstream programmers with little or no background in parallel algorithms, optimizing software for specific parallel hardware features, or the theoretical foundations of concurrency. Hence, multi-core programming models must place a premium on productivity and must make parallel programming accessible to the typical programmer. Similarly, while the history of parallel computing is dominated by highly specialized scientific applications, multicore processors will need to run the full range of general purpose applications. This implies a drastically increased diversity in the nature of applications and an expanded range of optimization goals. This will heavily impact the choice of the programming model for multi-core chips. The programming models for multi-core architectures should also be capable of adapting to and exploiting asymmetry (by design and accident) in processing cores. We argue that the above goals are often better served by a message passing programming model than a shared memory-based programming model. 

Cristiano Pereira, Gilles Pokam, Klaus Danne, Ramesh Devarajan and Ali-Reza Adl-Tabatabai. 

In 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar 2010), Berkeley, CA, USA, June 2010.

Abstract

The trend towards multi-core processors has spurred a renewed interest in developing parallel programs. Parallel applications exhibit non-deterministic behavior across runs, primarily due to the changing inter-leaving of shared-memory accesses from run to run. The non-deterministic nature of these programs lead to many problems, which will be described in this paper. In order to get a handle on the non-determinism, techniques to record and replay these programs have been proposed. These allow capturing the non-determinism and repeat the execution of a program, hence enabling better understanding and eliminating problems due to non-determinism. In this paper, we argue that an efficient record and replay (R&R) system must be assisted by hardware support. However, adding support for R&R in hardware requires strong justification, in terms of added value to the processor. Up until now, debugging has been the primary motivation used by previous work. We argue that stronger usage models are required to justify the cost of adding hardware support for R&R. Hence the first contribution of this paper are additional usage models (the virtues of R&R) and how we think they will add value to future micro-processors. We think the current lack of focus on wider and more applicable usage models is an obstacle for its adoption. The other obstacle for implementation of hardware-assisted R&R is that some complexities of such endeavor have not been addressed and studied properly. Current research approaches have shortcomings that prevent them from becoming an implementable feature. In this paper, we also identify those shortcomings and indicate research directions to bridge the gap between the research and a product implementation .

[2009]

Gilles Pokam, Cristiano Pereira, Klaus Danne, Lynda Yang, Sam, King, and Josep Torrellas. 

In Intel Technology Journal (ITJ), Volume 13, Issue 4, Dec. 2009.

Abstract

As multi-processors become mainstream, software developers must harness the parallelism available in programs to keep up with multi-core performance. Writing parallel programs, however, is notoriously difficult, even for the most advanced programmers. The main reason for this lies in the nondeterministic nature of concurrent programs, which makes it very difficult to reproduce a program execution. As a result, reasoning about program behavior is challenging. For instance, debugging concurrent programs is known to be difficult because of the non-determinism of multi-threaded programs. Malicious code can hide behind non-determinism, making software vulnerabilities much more difficult to detect on multi-threaded programs. 

In this article, we explore hardware and software avenues for improving the programmability of Intel® multi-processors. In particular, we investigate techniques for reproducing a non-deterministic program execution that can efficiently deal with the issues just mentioned. We identify the main challenges associated with these techniques, examine opportunities to overcome some of these challenges, and explore potential usage models of program execution reproducibility for debugging and fault tolerance of concurrent programs. 

Gilles Pokam, Cristiano Pereira, Klaus Danne, Rolf Kassa, and Ali-Reza Adl-Tabatabai. 

In Proceedings of the 42nd International Symposium on Microarchitecture (MICRO-42), New York City, NY, USA, Dec. 2009.

Abstract

Prior work on HW support for memory race recording piggybacks time stamps on coherence messages and logs the outcome of memory races using point-to-point or chunk-based approaches. These memory race recorder (MRR) techniques are effective, but they require modifications to the cache coherence protocol that can hurt performance. In addition, prior work has mostly focused on directory coherence and considered only CMP systems with single-level cache hierarchies. Most modern CMP systems shipped today, however, implement snoop coherence and feature multilevel cache hierarchies. To be practical, a MRR must target CMPs with multilevel caches, mitigate the coherence overhead due to piggybacking, and emphasize on replay speed to broaden applicability of deterministic replay. 

This paper contributes three new solutions for making chunk-based MRR practical for modern CMPs. We show that MRR interactions with a cache hierarchy can degrade performance and present a novel mechanism that mitigates this degradation. We propose new mechanisms for snoopbased caches that eliminate coherence traffic overhead due to piggybacking. We finally propose new techniques for improving replay speed and introduce a novel framework for evaluating the replay speed potential of MRR designs. 

Klaus Danne, Gilles Pokam and Cristiano Pereira. 

Gesellschaft fur Informatik (GI 2009), Fachgruppe Betriebssysteme, Klausurtagung Herbst, Nov. 2009.

Armin Heindl and Gilles Pokam. 

In 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications (ASMTA 2009), Madrid, Spain, June 2009.

Abstract

We extend an existing analytic framework for modeling software transactional memory (STM) to an optimistic STM variant in which write locks are acquired lazily. Lazy locking requires a different calculation of the transition probabilities of the underlying discrete-time Markov chain (DTMC).

Based on few relevant input parameters, like the number of concurrent transactions, the transaction lengths, the share of writing operations and the number of accessible transactional data ojects, a fixed-point iteration over closed-form analytic expressions delivers key STM performance measures, e.g., the mean number of transaction restarts and the mean number of processed steps of a transaction.

In particular, the analytic model helps to predict STM performance trends as the number of cores on multi-processors increases, but other performance trends provide additional insight into system behavior.

Armin Heindl and Gilles Pokam. 

In Computer Networks, Volume 53, Issue 8, p 1155-1302, June 2009.

Abstract

Analytic models based on discrete-time Markov chains (DTMC) are proposed to assess the algorithmic performance of Software Transactional Memory (TM) systems. Base STM variants are compared: optimistic STM with inplace memory updates and write buffering and pessimistic STM. Starting from an absorbing DTMC, closed-form analytic expressions are developed, which are quickly solved iteratively to determine key parameters of the considered STM systems, like the mean number of transaction restarts and the mean transaction length. Since the models reflect complex transactional behavior in terms of read/write locking, data consistency checks and conflict management independent of implementation details, they highlight the algorithmic performance advantages of one system over the other, which - due to their at times small differences - are often blurred by implementation of STM systems and even difficult to discern with statistically significant discrete-event simulations. 

Carl Svensson, David Kesler, Rakesh Kumar and Gilles Pokam. 

UIUC Technical Report UILU-09-2209, Apr. 2009.

Armin Heindl, Gilles Pokam and Ali-Reza Adl-Tabatabai. 

In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2009), Boston, Massachusetts, USA, April 2009.

Abstract

An analytic model is proposed to assess the performance of optimistic Software Transactional Memory (STM) systems with in-place memory updates for write operations. Based on an absorbing discrete-time Markov chain, closed-form analytic expressions are developed, which are quickly solved iteratively to determine key parameters of the STM system. The model covers complex implementation details such as read/write locking, data consistency checks and conflict management. It provides fundamental insight into the system behavior, when we vary input parameters like number and size of concurrent transactions or the number of the data objects. Numerical results are validated by comparison with a discrete-event simulation. 

Armin Heindl and Gilles Pokam. 

In Proceedings of the Second International Conference on Simulation Tools and Techniques (SIMUTools 2009), Rome, Italy, March 2009.

Abstract

A flexible simulation model is presented to study different variants of software transactional memory (STM), like pessimistic STM or optimistic STM either with inplace memory updates or write buffering. The dynamic behavior of transactions is encoded in timed statecharts as provided by the simulation tool AnyLogic in its implementation of real-time UML. Their graphical representation helps to convey the key design issues of the simulation model within this publication. Statistically significant numeric results for varying parameters, like number of threads, number of transactional operations, number of transactional data objects, are obtained efficiently as part of a Parameter Variation Experiment. 

[2008]

Michael Konow, Ali-Reza Adl-Tabatabai, Gilles Pokam, Jesse Barnes and Rolf Kassa. 

In Intel Design and Test Technology Conference (DTTC 2008), Oregon, Portland, USA, Aug. 2008.

[2006]

Weihaw Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Biesbrouck, Gilles Pokam, Osvaldo Colavin and Brad Calder. 

In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII), San Jose, CA, USA, Oct 2006.

Abstract

Exploiting thread level parallelism is paramount in the multi-core era. Transactions enable programmers to expose such parallelism by greatly simplifying the multi-threaded programming model. Virtualized transactions (unbounded in space and time) are desirable, as they can increase the scope of transactions’ use, and thereby further simplify a programmer’s job. However, hardware support is essential to support efficient execution of unbounded transactions. In this paper, we introduce Page-based Transactional Memory to support unbounded transactions. We combine transaction bookkeeping with the virtual memory system to support fast transaction conflict detection, commit, abort, and to maintain transactions’ speculative data. 

Olivier Rochecouste, Gilles Pokam and Andre Seznec. 

ACM Transactions on Architecture and Code Optimization (ACM-TACO), Vol 3, No. 3, September 2006, Pages 295-326.

Abstract

The analysis of program executions reveals that most integer and multimedia applications make heavy use of narrow-width operations, i.e., instructions exclusively using narrow-width operands and producing a narrow-width result. Moreover, this usage is relatively well distributed over the application. We observed this program property on the MediaBench and SPEC2000 benchmarks with about 40% of the instructions being narrow-width operations. Current superscalar processors use 64-bit datapaths to execute all the instructions of the applications. In this paper, we suggest the use of a width-partitioned microarchitecture (WPM) to master the hardware complexity of a superscalar processor. For a four-way issue machine, we split the processor in two two-way clusters: the main cluster executing 64-bit operations, load/store, and complex operations and a narrow cluster executing the 16-bit operations. We resort to partitioning to decouple the treatment of the narrow-width operations from that of the other program instructions. This provides the benefit of greatly simplifying the design of the critical processor components in each cluster (e.g., the register file and the bypass network). The dynamic interleaving of the two instruction types allows maintaining the workload balanced among clusters. WPM also helps to reduce the complexity of the interconnection fabric and of the issue logic. In fact, since the 16-bit cluster can only communicate narrow-width data, the datapath-width of the interconnect fabric can be significantly reduced, yielding a corresponding saving of the interconnect power and area. We explore different possible configurations of WPM, discussing the various implementation tradeoffs. We also examine a speculative steering heuristic to distribute the narrow-width operations among clusters. A detailed analysis of the complexity factors shows using WPM instead of a classical 64-bit two-cluster microarchitecture can save power and silicon area with a minimal impact on the overall performance. 

Satish Narayanasamy, Gilles Pokam and Brad Calder. 

IEEE Micro Special Issue: Top Picks from Computer Architecture Conferences (Micro's Top Picks), Jan/Feb 2006.

Abstract

WITH SOFTWARE’S INCREASING COMPLEXITY, PROVIDING EFFICIENT HARDWARE SUPPORT FOR SOFTWARE DEBUGGING IS CRITICAL. HARDWARE SUPPORT IS NECESSARY TO OBSERVE AND CAPTURE, WITH LITTLE OR NO OVERHEAD, THE EXACT EXECUTION OF A PROGRAM. PROVIDING THIS ABILITY TO DEVELOPERS WILL ALLOW THEM TO DETERMINISTICALLY REPLAY AND DEBUG AN APPLICATION TO PIN-POINT THE ROOT CAUSE OF A BUG. 

[2005]

Satish Narayanasamy, Gilles Pokam and Brad Calder. 

In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA 2005), Madison, Wisconsin, USA, June 2005.

Abstract

Significant time is spent by companies trying to reproduce and fix the bugs that occur for released code. To assist developers, we propose the BugNet architecture to continuously record information on production runs. The information collected before the crash of a program can be used by the developers working in their execution environment to deterministically replay the last several million instructions executed before the crash. 

BugNet is based on the insight that recording the register file contents at any point in time, and then recording the load values that occur after that point can enable deterministic replaying of a program’s execution. BugNet focuses on being able to replay the application’s execution and the libraries it uses, but not the operating system. But our approach provides the ability to replay an application’s execution across context switches and interrupts. Hence, BugNet obviates the need for tracking program I/O, interrupts and DMA transfers, which would have otherwise required more complex hardware support. In addition, BugNet does not require a final core dump of the system state for replaying, which significantly reduces the amount of data that must be sent back to the developer. 

[2004]

Gilles Pokam and Francois Bodin. 

In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2004), West Lafayette, Indiana, USA, September 2004.

Abstract

Software optimization techniques are highly reliant on program behavior to deliver high performance. A key element with these techniques is to identify program paths that are likely to achieve the greatest performance benefits at runtime. Several approaches have been proposed to address this problem. However, many of them fail to cover larger optimization scope as they are restricted to loops or procedures. This paper introduces a novel approach for representing and analyzing complete program paths. Unlike the whole-program paths (WPPs) approach that relies on a DAG to represent program paths, our program trace is processed into a suffix-array that can enable very fast searching algorithms that run with time O(ln(N)), N being the length of the trace. This allows to process reasonable trace sizes offline, avoiding the high runtime overhead incurred by WPPs, while accurately characterizing hot paths. Our evaluation shows impressive performance results, with almost 48% of the code being covered by hot paths. We also demonstrate the effectiveness of our approach to optimize for power. For this purpose, an adaptive cache resizing scheme is used that shows energy savings in the order of 12%. 

Gilles Pokam and Francois Bodin. 

In Proceedings of the 11th International Workshop on Compilers for Parallel Computers (CPC 2004), Chiemsee, Germany, July 2004.

Gilles Pokam, Olivier Rochecouste, Andre Seznec and Francois Bodin. 

In Proceedings of the ACM SIGPLAN/SIGBED 2004 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2004), Washington, DC, USA, June 2004.

Abstract

This paper evaluates managing the processor’s datapath-width at the compiler level by means of exploiting dynamic narrow-width operands. We capitalize on the large occurrence of these operands in multimedia programs to build static narrow-width regions that may be directly exposed to the compiler. We propose to augment the ISA with instructions directly exposing the datapath and the register widths to the compiler. Simple exception management allows this exposition to be only speculative. In this way, we permit the software to speculatively accommodate the execution of a program on a narrower datapath-width in order to save energy. For this purpose,we introduce a novel register file organization,the byte-slice register file,which allows the width of the register file to be dynamically reconfigured,providing both static and dynamic energy savings. We show that by combining the advantages of the byte-slice register file with the advantages provided by clock-gating the datapath on a per-region basis,up to 17% of the datapath dynamic energy can be saved,while a 22% reduction of the register file static energy is achieved. 

Gilles Pokam, Stephan Bihan, Julien Simonnet, and Francois Bodin. 

Concurrency and Computation: Practice and Experience (CCPE), Volume 16, Issue 2-3, p 303 - 318, February - March 2004.

Abstract

In this paper, we propose SWARP, a retargetable preprocessor for exploiting multimedia instructions. The system mixes loop distribution, unrolling and pattern matching to exploit complex multimedia instructions. Contrary to all available systems, it can be extended at the user level. Using with a TriMedia processor we show that our system achieves important good code quality with a set of frequently used loop kernels for multimedia applications. 

Gilles Pokam and Francois Bodin. 

In Proceedings of the 8th IEEE Annual Worskhop on Interaction between Compilers and Computer Architectures (INTERACT-8), Madrid, Spain, February 2004.

Abstract

Managing the energy-performance tradeoff has become a major challenge with embedded systems. The cache hierarchy is a typical example where this tradeoff plays a central role. With the increasing level of integration density, a cache can feature millions of transistors, consuming a significant portion of the energy. At the same time however, a cache also permits to significantly improve performance. Configurable caches are becoming the "de-facto" solution to deal efficiently with these issues. Such caches are equipped with artifacts that enable one to resize it dynamically. With regard to embedded systems, however, many of these artifacts restrict the configurability at the application level. We propose in this paper to modify the structure of a configurable cache to offer embedded compilers the opportunity to reconfigure it according to a program dynamic phase, rather than on a per-application basis. We show in our experimental results that the proposed scheme has a potential for improving the compiler effectiveness to reduce the energy consumption, while not excessively degrading the performance. 

[2001]

Gilles Pokam, Julien Simonnet and Francois Bodin. 

In Proceedings of the 9th International Workshop on Compilers for Parallel Computers (CPC 2001), Edinburgh, Scotland UK, June 2001.

Abstract

Request for more computation power in the media computing domain has led to multimedia extensions in the instruction sets of modern processors. This has resulted in the implementation of new instructions which handle short data types and exploit subword parallelism for improving the performance. Compiler developments which aim at providing some degree of support to these instructions now emerge and usually fall into three categories: vectorization, idiom recognition and code generation. Unfortunately there is no consensus among constructors on the definition of multimedia instructions. Therefore, very few works have addressed the general issue of exploiting multimedia instructions from high level languages. 

In this paper, we present a prototype preprocessor for exploiting multimedia instructions in "C" code. Our preprocessor relies on idioms recognition techniques for detecting potential usages of multimedia instructions and exploits a rule-based graph abstraction for the rewriting process. 

Augusto Froehlich, Gilles Pokam and Wolfgang Schroeder-Preikschat.

In Proceedings of the 8th International Conference on High Performance Computing and Networking Europe (HPCN Europe 2000), Amsterdam, The Netherlands, May 2000.

Thesis

Gilles Pokam. 

PhD Thesis, #2995, University of Rennes 1, France, July 2004.

Gilles Pokam. 

Diploma Thesis, Technical University of Berlin, Dec 1999.