Past Projects

PAST PROJECTS

Secure Execution Environment
Architectural Support for Software Reliability
Intelligent Memory Hierarchy Design for Performance and Quality of Service
Analytical Performance Modeling of Memory Hierarchy
Helper Computing

PAST PROJECTS

Secure Execution Environment

ObfusMem: A Low-Overhead Access Obfuscation for Trusted Memories (ISCA 2017)

Trustworthy software requires strong privacy and security guarantees from a secure trust base in hardware. While chipmakers provide hardware support for basic security and privacy primitives such as enclaves and memory encryption. these primitives do not address hiding of the memory access pattern, information about which may enable attacks on the system or reveal characteristics of sensitive user data. State-of-the-art approaches to protecting the access pattern are largely based on Oblivious RAM (ORAM). Unfortunately, current ORAM implementations suffer from very significant practicality and overhead concerns, including roughly an order of magnitude slowdown, more than 100% memory capacity overheads, and the potential for system deadlock.

Memory technology trends are moving towards 3D and 2.5D integration, enabling significant logic capabilities and sophisticated memory interfaces. Leveraging the trends, we propose a new approach to access pattern obfuscation, called ObfusMem. ObfusMem adds the memory to the trusted computing base and incorporates cryptographic engines within the memory. ObfusMem encrypts commands and addresses on the memory bus, hence the access pattern is cryptographically obfuscated from external observers. Our evaluation shows that ObfusMem incurs an overhead of 10.9% on average, which is about an order of magnitude faster than ORAM implementations. Furthermore, ObfusMem does not incur capacity overheads and does not amplify writes. We analyze and compare the security protections provided by ObfusMem and ORAM, and highlight their differences.

Silent Shredder: Zero-Cost Shredding for Secure Non-Volatile Main Memory Controllers (ASPLOS 2016)

As non-volatile memory (NVM) technologies are expected to replace DRAM in the near future, new challenges and de- sign constraints should be considered when architecting NVM- based systems. For example, NVMs have slow and power- consuming writes, and limited write endurance. Thus, reducing the number of writes is highly desirable. Similarly, NVMs have a data remanence vulnerability, i.e., they retain data for a long time after being powered off. NVM encryption alleviates the vulnerability, but exacerbates limited endurance by increasing the number of writes to memory. In this paper, we propose an approach to reduce the number of writes to encrypted NVMs. We observe that in current systems a large percentage of all main memory writes can result from data shredding in operating systems, which is the process of zeroing out physical pages before mapping them to new processes, in order to protect previous processes’ data. Our Non-Volatile Main Memory controller, Silent Shredder, re- purposes initialization vectors used in standard counter mode encryption to completely eliminate the writes occurring due to data shredding. Furthermore, it speeds up reading shredded cache lines, and hence reduces power consumption and improves overall performance. We discuss several use cases, including virtual machines’ data isolation and user-level large data initialization, where Silent Shredder can be used effectively at no extra cost. To evaluate our design, we use gem5, a detailed full-system simulator, to run 3 graph analytics applications from the PowerGraph framework and 26 multi-programmed workloads from the SPEC 2006 benchmark suite. Silent Shredder eliminates an average of 48.6% of the writes in the initialization and graph construction phases. Furthermore, it speeds up main memory reads by 3.3 times on average, and improves the number of instructions per cycle (IPC) by 6.4% on average.

i-NVMM: A Secure Non-Volatile Main Memory System with Incremental Encryption (ISCA 2011)

Emerging technologies for building non-volatile main memory (NVMM) systems suffer from a security vulnerability where information lingers on long after the system is powered down, enabling an attacker with physical access to the system to extract sensitive information off the memory. The goal of this study is to find a solution for such a security vulnerability. We introduce i-NVMM, a data privacy protection scheme for NVMM, where the main memory is encrypted incrementally, i.e. different data in the main memory is encrypted at different times depending on whether the data is predicted to still be useful to the processor. The motivation behind incremental encryption is the observation that the working set of an application is much smaller than its resident set. By identifying the working set and encrypt- ing remaining part of the resident set, i-NVMM can keep the majority of the main memory encrypted at all times without penalizing performance by much. Our experiments demonstrate promising results. i-NVMM keeps 78% of the main memory encrypted across SPEC2006 benchmarks, yet only incurs 3.7% execution time overhead, and has a negligible impact on the write endurance of NVMM, all achieved with a relatively simple hardware support in the memory module.

Comprehensively and Efficiently Protecting the Heap (ASPLOS 2006)

The goal of this paper is to propose a scheme that provides comprehensive security protection for the heap. Heap vulnerabilities are increasingly being exploited for attacks on computer programs. In most implementations, the heap management library keeps the heap meta-data (heap structure information) and the application's heap data in an interleaved fashion and does not protect them against each other. Such implementations are inherently unsafe: vulnerabilities in the application can cause the heap library to perform unintended actions to achieve control-flow and non-control attacks.

Unfortunately, current heap protection techniques are limited in that they use too many assumptions on how the attacks will be performed, require new hardware support, or require too many changes to the software developers' toolchain. We propose Heap Server, a new solution that does not have such drawbacks. Through existing virtual memory and inter-process protection mechanisms, Heap Server prevents the heap meta-data from being illegally overwritten, and heap data from being meaningfully overwritten. We show that through aggressive optimizations and parallelism, Heap Server protects the heap with nearly-negligible performance overheads even on heap-intensive applications. We also verify the protection against several real-world exploits and

attack kernels.

Full paper: pdf

Efficient Data Protection for Distributed Shared Memory Multiprocessors (PACT 2006)

Data security in computer systems has recently become an increasing concern, and hardware-based attacks have emerged. As a result, researchers have investigated hardware encryption and authentication mechanisms as a means of addressing this security concern. Unfortunately, no such techniques have been investigated for Distributed Shared Memory (DSM) multiprocessors, and previously proposed techniques for uni-processor and Symmetric Multiprocessor (SMP) systems cannot be directly used for DSMs. This work is the first to examine the issues involved in protecting secrecy and integrity of data in DSM systems. We first derive security requirements for processor-processor communication in DSMs, and find that different types of coherence messages need different protection. Then we propose and evaluate techniques to provide efficient encryption and authentication of the data in DSM systems. Our simulation results using SPLASH-2 benchmarks show that the execution time overhead for our three proposed approaches is small and ranges from 6% to 8% on a 16-processor DSM system, relative to a similar DSM without support for data secrecy and integrity.

Full paper: pdf

Improving Cost, Performance, and Security of Memory Encryption and Authentication (ISCA06)

Protection from hardware attacks such as snoopers and mod chips has been receiving increasing attention in computer architecture. This paper presents a new combined memory encryption/authentication scheme. Our new split counters for counter-mode encryption simultaneously eliminate counter overflow problems and reduce per-block counter size, and we also dramatically improve authentication performance and security by using the Galois/Counter Mode of operation (GCM), which leverages counter-mode encryption to reduce authentication latency and overlap it with memory accesses. Our results indicate that the split-counter scheme has a negligible overhead even with a small (32KB) counter cache and using only eight counter bits per data block. The combined encryption/authentication scheme has an IPC overhead of 5% on average across SPEC CPU 2000 benchmarks, which is a significant improvement over the 20% overhead of existing encryption/authentication schemes.

Full paper: pdf

MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging (HPCA 2007)

Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs also create security vulnerabilities. Unfortunately, existing software and even hardware approaches for finding and identifying these bugs have considerable performance overheads, target only a narrow class of bugs, are costly to implement, or use computational resources inefficiently.

This paper describes MemTracker, a new hardware support mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker associates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. The rich set of states, events, and transitions allowed by MemTracker supports different monitoring and debugging checkers with minimal performance overheads, including checkers which involve frequent state updates. To evaluate our MemTracker support, we map three different checkers onto it, as well as a checker that combines all three. For the most demanding (combined) checker, we observe performance overheads of only 2.7\% on average and 4.8\% worst-case on SPEC 2000 applications. Such low overheads allow continuous (always-on) use of MemTracker-based checkers even in production runs.

HeapMon: A helper-thread approach to programmable, automatic, and low-overhead memory bug detection (IBM JRD 2006)

The ability to detect and pinpoint memory-related bugs in production runs is important because in-house testing may miss bugs. This paper presents HeapMon, a heap memory bug-detection scheme that has a very low performance overhead, is automatic, and is easy to deploy. HeapMon relies on two new techniques. First, it decouples application execution from bug monitoring, which executes as a helper thread on a separate core in a chip multiprocessor system. Second, it associates a filter bit with each cached word to safely and significantly reduce bug checking frequency—by 95% on average. We test the effectiveness of these techniques using existing and injected memory bugs in SPEC^®2000 applications and show that HeapMon effectively detects and identifies most forms of heap memory bugs. Our results also indicate that the HeapMon performance overhead is only 5%, on average—orders of magnitude less than existing tools. Its overhead is also modest: 3.1% of the cache size and a 32-KB victim cache for on-chip filter bits and 6.2% of the allocated heap memory size for state bits, which are maintained by the helper thread as a software data structure.

Full paper: URL

Cache Design for Improved Performance and Quality of Service

Counter-Based Cache Replacement Algorithms (ICCD 2005)

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, suggesting that alternative replacement algorithms can improve the performance of the cache. One main reason for this performance gap is that in the LRU replacement algorithm, a line is only evicted after it becomes the LRU line, long after its last access/touch, while unnecessarily occupying the cache space for a long time. This paper proposes a new approach to deal with the problem: counter-based L2 cache replacement . In this approach, each line in the L2 cache is augmented with an event counter that is incremented when an event of interest, such as a cache access to the same set, occurs. When the counter exceeds a threshold, the line "expires", and becomes evictable. When expired lines are evicted early from the cache, they make extra space for lines that may be more useful, reducing the number of capacity and con ict misses. Each line's threshold is unique and is dynamically learned and stored in a small 40-Kbyte counter prediction table. We propose two new replacement algorithms: Access Interval Predictor (AIP) and Live-time Predictor (LvP). AIP and LvP speed up 10 (out of 21) SPEC2000 benchmarks by up to 40% and 11% on average.

Full paper: pdf

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture (HPCA 2005)

This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a Chip Multi-Processor (CMP) architecture. Cache sharing impacts threads non-uniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy suf cient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence pro le of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the Inductive Probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

Full paper: pdf

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture (PACT 2004)

This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the Operating System (OS) thread scheduler s effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4 , while increasing the throughput by 15%, compared to a non-partitioned shared cache.

Full paper: pdf

Using Prime Numbers for Cache Indexing to Eliminate Conflict Misses (HPCA 2004)

Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. In this paper, we present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have non-uniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%.

Full paper: pdf

Analytical Performance Modeling of Cache Performance

An Analytical Model for Cache Replacement Policy Performance (Sigmetrics 2006)

Due to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that affect cache performance is the cache replacement policy. Despite the importance, current analytical cache performance models ignore the impact of cache replacement policies on cache performance. To the best of our knowledge, this paper is the first to propose an

analytical model which predicts the performance of cache replacement policies. The input to our model is a simple circular sequence profiling of each application, which requires very little storage overhead. The output of the model is the predicted miss rates of an application under different replacement policies. The model is based on probability theory and utilizes Markov processes to compute each cache access' miss probability. The model uses realistic assumptions and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs. The model's run time is less than 0.1 seconds, much lower than that of trace simulations. We validate the model by comparing the predicted miss rates of seventeen Spec2000 and NAS benchmark applications against the miss rates obtained by detailed execution-driven simulations, across a range of different cache sizes, associativities, and four replacement policies, and show that the model

is very accurate. The model's average prediction error is 1.41\%, and there are only 14 out of 952 validation points in which the prediction errors are larger than 10\%.

Full paper: pdf

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture (HPCA 2005)

This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a Chip Multi-Processor (CMP) architecture. Cache sharing impacts threads non-uniformly, where some threads may be slowed down signi cantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy suf cient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence pro le of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the Inductive Probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

Full paper: pdf

Helper Thread Prefetching

Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems (IPDPS 2006)

This paper presents a helper thread prefetching scheme that is designed to work on loosely-coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely-coupled processors have an advantage in that ne-grain resources, such as processor and L1 cache resources, are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely-coupled system can be done effectively, we evaluate our prefetching in a standard, unmodified CMP system, and in an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33.

Full paper: pdf