Shuai Che

I now work with the infrastructure team at Microsoft AI and has also been a member of Project DeepSpeed and Project Brainwave. I was previously employed by AMD Research and involved in the U.S. Department of Energy’s Fastforward and Pathforward Exascale computing projects. I also had work experience in Alibaba's machine learning group, performing system research and development. I was the lead developer of the Rodinia benchmark suite for heterogeneous computing. Rodinia has been included in the SPEC ACCEL V1.0 and SPECwpc V1.0 as standard accelerator benchmarks. I graduated from the University of Virginia in August 2012 with a Ph.D. in Computer Engineering.

Something I did for fun in spare time

Research Interests

AI and machine learning systems, GPGPU, and graph processing

Selected Papers and Reports (Google Scholar/DBLP)

(To be updated)

● S. Che and J. Yin. Northup: Divide-and-Conquer Programming for Systems with Heterogeneous Memories and Processors. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2019.

● H. Yin, G. Chen, Y. Li, S. Che, W. Zhang and N. K. Jha. Hardware-Guided Symbolic Training for Compact, Accurate, yet Execution-Efficient LSTMs. https://arxiv.org/abs/1901.10997 .

● Y. Yu, Y. Li, S. Che, W. Zhang and N. K. Jha. Software-defined Design Space Exploration for Efficient AI Accelerator Architecture. https://arxiv.org/abs/1903.07676 .

● J. Yin, Y. Eckert, S. Che, M. Oskin, G.. Loh. Toward More Efficient NoC Arbitration: A Deep Reinforcement Learning Approach. In the International Workshop on AI-assisted Design for Architecture in conjunction of with ISCA, June 2018.

● S. Che, B. M. Beckmann, and S. K. Reinhardt. Programming GPGPU Graph Applications with Linear Algebra Building Blocks. To appear: International Journal of Parallel Programming (IJPP), 2017.

● M. Orr, S. Che, B. Beckmann, M. Oskin, S. K. Reinhardt, and D. Wood. Gravel: Efficient Fine-grain GPU-initiated Network Messaging. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2017.

● S. Che, M. Orr, and J. Gallmeier. Work Stealing in a Shared Virtual Memory Heterogeneous Environment. In Proceedings of the ACM International Conference on Computing Frontiers (CF), May 2017.

● K. Hou, W. Feng, and S. Che. Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), May 2017.

● N. Malaya, S. Che, J. Greathouse, R. Oostrum, and M. Schulte. Accelerating Matrix Processing with GPUs. In Proceedings of the IEEE Symposium on Computer Arithmetic (ARITH), invited paper, July 2017.

● S. Che, A. Basu, and J. Gallmeier. Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors. In Proceeding of the International Symposium on Memory Systems, Oct 2016.

● A. Basu, S. Puthoor, S. Che, and B. Beckmann. Software Assisted Hardware Cache Coherence for Heterogeneous Architectures. In Proceeding of the International Symposium on Memory Systems, Oct 2016.

● S. Che, M. Orr, G. Rodgers, and J. Gallmeier. Betweenness Centrality in an HSA-enabled System. In the 1st High Performance Graph Processing workshop (HPGP), May 2016.

● S. Puthoor, A. Aji, S. Che, M. Daga, W. Wu, B. M. Beckmann, and G. Rodgers. Implementing Directed Acyclic Graphs with the Heterogeneous System Architecture. In the 9th Workshop on General Purpose Processing on Graphics Processing Units, Mar 2016.

● S. Che, G. Rodgers, B. M. Beckmann, and S. K. Reinhardt. Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance. In Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), May 2015.

● M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization Using Remote-Scope Promotion. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2015. (pdf)

● G. Juckeland, W. Brantley, S. Chandrasekaran, B. Chapman, S. Che, M. Colgrove, H. Feng, A. Grund, R. Henschel, W-M. Hwu, H. Li, M. S. Muller, M. Perminov, P. Shelepugin, K. Skadron, J. Stratton, A. Titov, K. Wang, M. Waveren, B. Whitney, S. Wienke, R. Xu, and K. Kumaran. SPEC ACCEL - A Standard Application Suite for Measuring Hardware Accelerator Performance. In Proceedings of 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Nov 2014. (pdf)

● S. Che, B. M. Beckmann, and S. K. Reinhardt. BelRed: Constructing GPGPU Graph Applications with Software Building Blocks. In Proceedings of IEEE High Performance Extreme Computing Conference (HPEC), Sept 2014. (pdf)

● S. Che. GasCL: A Vertex-Centric Graph Model for GPUs. In Proceedings of the IEEE High Performance Extreme Computing Conference, Sept 2014. (pdf)

● S. Che, J. Meng and K. Skadron. Dymaxion++: a Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems. The 4th International Workshop on Accelerators and Hybrid Exascale Systems, May 2014.(pdf)

● B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2014. (pdf)

● S. Che, B. M. Beckmann, S. K. Reinhardt and K. Skadron. Accelerating and Evaluating OpenCL Graph Applications. AMD Developer Summit (APU), Nov 2013. (pdf)

● S. Che and K. Skadron. BenchFriend: Correlating the Performance of GPU Benchmarks. International Journal of High-Performance Computing Applications (IJHPCA), Oct 2013. (pdf)

● S. Che, B. M. Beckmann, S. K. Reinhardt and K. Skadron. Pannotia: Understanding Irregular GPGPU Graph Applications. In Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013. (pdf)

● M. Boyer, K. Skadron, S. Che, and N. Jayasena. Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability. In Proceedings of the 10th Conference on Computing Frontiers (CF), May 2013. (pdf)

● W. Heirman, T. E. Carlson, S. Che, K. Skadron, and L. Eeckhout. Using Cycle Stacks to Understand Scaling Bottlenecks in Multi-Threaded Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), Nov. 2011. (pdf)

● S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing), Nov. 2011. (pdf)

● S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. InProceedings of the IEEE International Symposium on Workload Characterization (IISWC), Dec. 2010. (pdf)

● S. Che, M. Boyer, J. Meng, D. Tarjan, S. Lee, J. W. Sheaffer, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of IEEE International Symposium on Workload Characterization (IISWC), Oct 2009. (pdf)

● S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A Performance Study of General Purpose Applications on Graphics Processors using CUDA. Journal of Parallel and Distributed Computing (JPDC), 68(10):1370-1380, Jun 2008. (pdf)

● S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Accelerating Compute Intensive Applications with GPUs and FPGAs. In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), Jun 2008. (pdf)

● S. Che, J. Meng, J. W. Sheaffer, and K. Skadron. A Performance Study of General Purpose Applications on Graphics Processors. First Workshop on General Purpose Processing on Graphics Processing Units, Oct 2007. (pdf)

● J. Meng, S. Che, J. W. Sheaffer, J. Li, J. Huang and K. Skadron. Hierarchical Domain Partitioning For Hierarchical Architectures. Tech. Report CS-2008-08, Univ. of Virginia Dept. of Computer Science, Jun 2008. (pdf)

● J. Meng, S. R. T arapore, S. Che, J. Huang, J. W. Sheaffer, and K. Skadron. Programming with Relaxed Streams. Tech. Report CS-2007-17, Univ. of Virginia Dept. of Computer Science,Dec 2007. (pdf)

Google Sites

Report abuse