Projects

I have had the pleasure of working on many projects over the last couple of decades.  This page lists a few of those projects, and I'll try to be more diligent about creating pages for my projects in the future.

GPU Memory Debugging

2021-

CUDA, OpenCL, and OpenACC are the primary means of writing general-purpose software for NVIDIA GPUs, all of which are subject to the same well-documented memory safety vulnerabilities currently plaguing software written in C and C++. One can argue that the GPU execution environment makes software development more error prone. Unlike C and C++, CUDA features multiple, distinct memory spaces to map to the GPU’s unique memory hierarchy, and a typical CUDA program has thousands of concurrently executing threads. Furthermore, the CUDA platform has fewer guardrails than CPU platforms that have been forced to incrementally adjust to a barrage of security attacks. Unfortunately, the peculiarities of the GPU make it difficult to directly port memory safety solutions from the CPU space.

We have been working on memory safety solutions that target the GPU and the CUDA platform.  Our first published work in this area is a software-only solution, called cuCatch.  See our PLDI publication to learn more.

Cooperative Profile Guided Optimization

2019-

Existing feedback-driven optimization frameworks are not suitable for video games, which tend to push the limits of performance of gaming hardware and software and have hard real-time constraints that preclude all but the simplest execution profiling. While profile-guided optimizations (PGO) have low-runtime overheads due to compile-time optimization algorithms, they require multiple compilation passes and respond poorly to interactive applications. We introduce Cooperative PGO, a methodology in which the gaming platform collects piecemeal profiles by sampling in both time and space during actual gameplays across many users; stitches the piecemeal profiles together statistically; and creates policies to guide future gameplay. We introduce a three-level hierarchical profiler that is well-suited to graphics APIs that commonly operates with no overhead, and occasionally introduces an average overhead of less than 0.5% during periods of active profiling. We examine the practicality of Cooperative PGO using three PGOs as case studies. We apply a Cooperative PGO approach to our PGZ PGO (see below) and achieve an average speedup of 5%, with a maximum speedup of 15%, over a highly tuned baseline.  A paper on this project, which you can find here, will appear at this year's conference on High Performance Graphics and in the associated journal Computer Graphics Forum.  You can also watch the HPG talk here.

Profile-Guided Zero-Value Specialization for Direct3D

2019-

We have characterized register operand value profiles in shader programs of modern games and found that many operands are likely to dynamically contain the value zero.  Furthermore, we found that many of these zero operands feed long arithmetic chains whereby zero propagates through the forward slice of computation, and can then render backward slices of computation unnecessary.  We were able to gain an intuition for how these zero values originate and propagate through the computation and we devised a manual transformation, which we call Zeroploit, to specialize computation for high-payoff, likely-zero operands.   We demonstrate that Zeroploit is able to achieve an average speedup of 35.8% for targeted shader programs, amounting to an average frame-rate speedup of 2.8% across a collection of modern games on an NVIDIA® GeForce RTX™ 2080 GPU.  See, "Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications", for all the details

We then automated the manual successes of Zeroploit by designing  a series of transformations we call PGZ.  We implemented PGZ in the NVIDIA GeForce Game Ready driver.  On a collection of modern gaming applications PGZ achieved an average speedup of 21% for targeted shader programs and an average frame-rate speedup of over 4%.  See, "PGZ: Automatic Zero-Value Code Specialization" for details and the accompanying presentation.

Mark Stephenson SASSI

GPU Binary Instrumentation

2014-2018

We created some generic binary instrumentation tools for NVIDIA GPUs.  I developed a compiler-based instrumentation tool, called SASSI, and Oreste Villa created a binary instrumentation tool called NVBit.  

Adjustable Program-Level Anomaly Detection

2010

This was one of my favorite projects to work on!  We created a statistical anomaly detection system that a compiler embeds within a deployed application to help thwart malicious attacks of vulnerable code.  Our approach crowd sources low-level profiling of application behavior, which our system analyzes to automatically build anomaly detection models.  With statistical modeling of abnormal events, there is a fundamental balance between reducing false positives (i.e., reducing the likelihood of flagging a legitimate execution) and increasing false negatives (i.e., increasing the likelihood of a malicious attack not getting flagged).   Our system gave the end-user a simple mechanism to adjust this balance according to their tolerance for risk.  

See this paper from CGO 2010 and this associated slide deck for more details.

Predication Techniques for Out-of-Order Processors

2009

Out-of-Order (OOO) processing is a well-known and effective technique for improving the performance of a single thread of execution.  Predication is another well-known and effective technique for optimizing unpredictable or short regions of control flow.  However, these two proven techniques are at odds with one another.  While many in-order processors offer full predication support, OOO processors offer only "conditional move" style predication.  We introduce a generalized form of "hammock" predication that requires few modifications to an existing processor pipeline, yet presents a compiler with abundant predication opportunities. 

For more information, see our HPCA paper.

2001-2005

We proposed Meta Optimization, which was an early attempt to automatically create compiler heuristics using machine learning. 

Please consult the following resources to learn more:

Mark Stephenson, Saman Amarasinghe. Predicting Unroll Factors Using Supervised Classification. In Proceedings of International Symposium on Code Generation and Optimization (CGO). San Jose, California. March 2005 (ppt, project page).

Diego Puppin, Mark Stephenson, Saman Amarasinghe, Una-May O'Reilly, Martin Martin. Adapting Convergent Scheduling Using Machine Learning. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing, College Station, TX, October 2003 (pdf, ppt).

Mark Stephenson, Martin Martin, Una-May O'Reilly, and Saman Amarasinghe. Meta Optimization: Improving Compiler Heuristics with Machine Learning. In Proceedings of the SIGPLAN '03 Conference on Programming Language Design and Implementation (PLDI), San Diego, CA, June 2003 (pdf, ppt, project page).

Mark Stephenson, Una-May O'Reilly, Martin Martin, and Saman Amarasinghe. Genetic Programming Applied to Compiler Heuristic Optimization. In Proceedings of the 6th European Conference on Genetic Programming (EuroGP), Essex, UK, April 14, 2003 (pdf, ppt, project page).

Automating the Construction of Compiler Heuristics Using Machine LearningPhD thesis.  Massachusetts Institute of Technology.  May 2006 (pdf, project page, job talk).

Microsoft recorded my job talk on this topic, which you can watch here.

1999-2000

We created a compiler analysis that infers the bitwidth and data ranges of program variables.  The scope of the analysis includes fixed point arithmetic, bit manipulation and Boolean operations.  It uses additional sources of information such as type casts, array bounds, and loop iteration counts to refine bitwidth and range information.

We evaluated the analysis in the context of C-to-silicon compiler.

Mark Stephenson, Jonathan Babb, and Saman AmarasingheBitwidth Analysis with Application to Silicon Compilation.  In Proceedings of the SIGPLAN conference on Programming Language Design and Implementation (PLDI), Vancouver, British Columbia, June 2000 (pdf, ppt, project page).