My main area of research is Parallel Computer Architecture and its interface with Compilers and Parallel Programming.
My current research interests include the following:
1. Improving CUDA SASS (Streaming Assembler) through device-specific optimizations. SASS assembles PTX virtual instructions into binary microcode that executes natively on GPU hardware. SASS largely does three things: 1) Translate PTX virtual instructions to native instructions, 2) Perform register allocation of virtual registers to device physical registers, and 3) Schedule native instructions on the GPU to minimize latency. Until recently, SASS was a black box to researchers because NVIDIA only publicizes the PTX virtual ISA but keeps the native ISA of each GPU architecture secret. That has changed due to painstaking efforts of researchers around the world to discover the native ISAs of each GPU by trial-and-error. Piggybacking on top of these efforts, we have found that the existing SASS assembler does not do an optimal job in many situations.
2. Improving the quantum computer (QC) circuit assembler. Currently, the "source code" for quantum computers consists of a quantum circuit formed using quantum gates such as Hadamard gates and controlled NOT gates. The job of the assembler is to translate this programmed circuit to the physical circuit of the quantum computer. The first prerequisite for this to be possible is that the QC physical circuit has at least as any quantum bits (qubits) as the programmed circuit. But even if the QC has enough qubits, this translation is not trivial since the programmed circuit may not fit into the physical circuit as is. The physical circuit layout of current QCs are irregular and not all gates are available between all qubits in the circuit. A lot of massaging needs to happen to the original programmed circuit for this to work, including shifting qubits around and inserting additional quantum gates. Inserting additional quantum gates has two downsides: 1) Quantum gates are noisy such that the more gates you use, the more inaccurate the final result is going to be and 2) Qubits decay over time such that if they have to go through more gates, the final result will again be more inaccurate. It has been shown that current assemblers are far from optimal and this is what we seek to improve.
3. Improving the performance of scripting languages through hardware/software co-optimization. Scripting languages are gaining wide spread use as programming comes to the masses. However, scripting languages are notoriously hard to compile efficiently due to their unstructured dynamism. On the other hand, trends in hardware point to a future where processors are increasingly energy-conscious, multicore, and heterogeneous. This adds another layer of complexity to efficient compilation and execution as execution has to be increasingly parallelized to take advantage of the available resources. My goal is to find a cross-disciplinary solution to this problem that involves both hardware and compiler, initially targeting the JavaScript language.
My past research has also focused on hardware/compiler cross-disciplinary solutions. I have worked on leveraging hardware Transactional Memory (TM) support to perform compiler optimizations such as alias speculation and memory ordering speculation. I have also worked on leveraging hardware Bloom filters to perform compiler optimizations such as function memoization. I have also worked on TM and Thread Level Speculation (TLS) support to make parallel programming easier and sometimes to auto-parallelize programs. I contributed to the release of the TM and TLS system in the IBM Bluegene/Q supercomputer which is one of the first machines to support this in hardware.
Chi Zhang, Wonsun Ahn, Youtao Zhang, and Bruce R. Childers. Live Code Update for IoT Devices in Energy Harvesting Environments. In Non-Volatile Memory Systems and Applications Symposium, August 2016. [Presentation slides]
Wonsun Ahn, Jiho Choi, Thomas Shull, Maria Garzaran, Josep Torrellas. Improving JavaScript Performance by Deconstructing the Type System. In International Conference on Programming Language Design and Implementation (PLDI) -- Distinguished Paper Award, June 2014. [Presentation slides]
Wonsun Ahn, Yuelu Duan, Josep Torrellas. DeAliaser: Alias Speculation Using Atomic Region Support. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2013. [Presentation slides]
Wonsun Ahn, Shanxiang Qi, Jae-Woo Lee, Marios Nicolaides, Xing Fang, Josep Torrellas, David Wong and Samuel Midkiff. BulkCompiler: High-Performance Sequential Consistency through Cooperative Compiler and Hardware Support. In International Symposium on Microarchitecture (MICRO), December 2009. [Presentation slides]
James Tuck, Wonsun Ahn, Luis Ceze and Josep Torrellas. SoftSig: Software-Exposed Hardware Signatures for Code Analysis and Optimization. In Micro's Top Picks from Computer Architecture Conferences (TOPPICKS), January 2009.
James Tuck, Wonsun Ahn, Luis Ceze and Josep Torrellas. SoftSig: Software-Exposed Hardware Signatures for Code Analysis and Optimization. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2008. [Presentation slides]
Shanxiang Qi, Abdullah Muzahid, Wonsun Ahn, Josep Torrellas. Dynamically Detecting and Tolerating IF-Condition Data Races. In International Symposium on High Performance Computer Architecture (HPCA), February 2014. [Presentation slides]
Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas. POSH: A TLS Compiler that Exploits Program Structure. In Symposium on Principles and Practice of Parallel Programming (PPoPP), March 2006. [Presentation slides]
Yuelu Duan, Xing Zhou, Wonsun Ahn, Josep Torrellas. BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks. In International Symposium on High Performance Computer Architecture (HPCA), February 2012. [Presentation slides]
Wucherl Yoo, Kevin Larson, Sangkyum Kim, Wonsun Ahn, Roy H. Campbell, and Lee Baugh. Automated Fingerprinting of Performance Pathologies Using Performance Monitoring Units (PMUs). In USENIX Workshop on Hot Topics in Parallelism (HotPar), May 2011.
Xuehai Qian, Wonsun Ahn and Josep Torrellas. ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment. In International Symposium on Microarchitecture (MICRO), December 2010. [Presentation slides]
Josep Torrellas, Luis Ceze, James Tuck, Calin Cascaval, Pablo Montesinos, Wonsun Ahn and Milos Prvulovic. The Bulk Multicore for Improved Programmability. In Communications of the ACM (CACM), December 2009. [Presentation slides]
Pablo Montesinos, Matthew D. Hicks, Wonsun Ahn, Samuel T. King and Josep Torrellas. Lessons Learned During the Development of the CapoOne Deterministic Multiprocessor Replay System. In Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), June 2009. [Presentation slides]
U.S. Patent #20110219381, Multiprocessor System with Multiple Concurrent Modes of Execution, International Business Machines, September 2011.
U.S. Patent #20110219191, Reader Set Encoding for Directory of Shared Cache Memory in Multiprocessor System, International Business Machines, September 2011.
Plug: if you are organizing virtual conferences during COVID-19, please consider the Whova virtual conference platform (resources). Luke Duan, my good friend and former colleague is a founding member at Whova and has been working there since.