Paper Readings

This is the list of papers you need to read for each course topic. You will need to submit a weekly paper review for a subset of the papers, listed here.

Intro:

[1] ​A. J. Smith, "The Task of the Referee," IEEE Computer, 1990.

[2] M. D. Hill, S. Adve, L. Ceze, M. J. Irwin, D. Kaeli, M. Martonosi, J. Torrellas, T. F. Wenisch, D. Wood and K. Yelick, "21st Century Computer Architecture," CCC Whitepaper, 2012.

Multicores and Multiprogramming:

[3] E. Fatehi and P. V. Gratz, "ILP and TLP in Shared Memory Applications: A Limit Study," PACT, 2014.

[4] C. Bienia, S. Kumar, J. P. Singh and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," PACT, 2008.

[5] K. Kennedy and K. S. McKinley, "Optimizing for Parallelism and Data Locality," ICS, 1992.

Synchronization:

[6] M. L. Scott, "Shared-Memory Synchronization," Synthesis Lectures on Computer Architecture, Chapters 1, 4.0-4.3.3 and 5.0-5.2.5.

[7] R. Rajwar and J. R. Goodman, "Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution," MICRO, 2001.

Cache and Memory Hierarchy:

[8] D. J. Sorin, M. D. Hill and D. A. Wood, "A Primer on Memory Consistency and Cache Coherence," Synthesis Lectures on Computer Architecture, Chapter 2.

[9] J. Kim, M. Sullivan, E. Choukse and M. Erez, "Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures," ISCA, 2016.

[10] J. San Miguel, J. Albericio, N. Enright Jerger and A. Jaleel, "The Bunker Cache for Spatio-Value Approximation," MICRO, 2016.

[11] M. A. Ogleari, E. L. Miller and J. Zhao, "Steal but No Force: Efficient Hardware Undo+Redo Logging for Persistent Memory Systems," HPCA, 2018.

Coherence:

[8] D. J. Sorin, M. D. Hill and D. A. Wood, "A Primer on Memory Consistency and Cache Coherence," Synthesis Lectures on Computer Architecture, Chapters 6-8.

[12] P. V. Rengasamy, A. Sivasubramaniam, M. T. Kandemir and C. R. Das, "Exploiting Staleness for Approximating Loads on CMPs," PACT, 2015.

[13] G. Zhang, W. Horn and D. Sanchez, "Exploiting Commutativity to Reduce the Cost of Updates to Shared Data in Cache-Coherent Systems," MICRO, 2015.

Transactional Memory:

[14] T. Harris, J. Larus and R. Rajwar, "Transactional Memory, 2nd Edition," Synthesis Lectures on Computer Architecture, Chapters 1 and 5.

[15] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill and D. A. Wood, "LogTM: Log-Based Transactional Memory," HPCA, 2006.

Consistency:

[8] D. J. Sorin, M. D. Hill and D. A. Wood, "A Primer on Memory Consistency and Cache Coherence," Synthesis Lectures on Computer Architecture, Chapters 3-5.

[16] S. V. Adve and K. Gharachorloo, "Shared Memory Consistency Models: A Tutorial," IEEE Computer, 1996.

[17] M. D. Hill, "Multiprocessors Should Support Simple Memory Consistency Models," IEEE Computer, 1998.

[18] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter and C.-T. Chou, "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," PACT, 2011.

Interconnects:

[19] N. Enright Jerger, T. Krishna and L.-S. Peh, "On-Chip Networks, Second Edition," Synthesis Lectures on Computer Architecture, Chapters 3-6.

[20] T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in On-Chip Networks," ISCA, 2009.

[21] J. Kim, J. Balfour and W. Dally, "Flattened Butterfly Topology for On-Chip Networks," MICRO, 2007.

[22] Z. Li, J. San Miguel and N. Enright Jerger, "The Runahead Network-On-Chip," HPCA, 2016.

[23] M. Parasar, H. Farrokhbakht, N. Enright Jerger, P. Gratz, T. Krishna and J. San Miguel, "DRAIN: Deadlock Removal for Arbitrary Irregular Networks," HPCA, 2020.

GPUs:

[24] H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi and W.-M. Hwu, "Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)," Synthesis Lectures on Computer Architecture, Chapter 1.

[25] D. Wong, N. S. Kim and M. Annavaram, "Approximating Warps with Intra-Warp Operand Value Similarity," HPCA, 2016.

Accelerators:

[26] M. S. B. Altaf and D. A. Wood, "LogCA: A High-Level Performance Model for Hardware Accelerators," ISCA, 2017.

[27] T. Nowatzki, V. Gangadhan, K. Sankaralingam and G. Wright, "Pushing the Limits of Accelerator Efficiency while Retaining Programmability," HPCA, 2016.

Unconventional Parallelism:

[28] M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer and D. Sanchez, "A Scalable Architecture for Ordered Parallelism," MICRO, 2015.

[29] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw and R. Das, "Compute Caches," HPCA, 2017.

[30] J. San Miguel and N. Enright Jerger, "The Anytime Automaton," ISCA, 2016.

[31] A. Madhavan, T. Sherwood and D. Strukov, "Race Logic: A Hardware Acceleration for Dynamic Programming Algorithms," ISCA, 2014.

[32] D. Wu, J. Li, R. Yin, H. Hsiao, Y. Kim and J. San Miguel, "uGEMM: Unary Computing Architecture for GEMM Applications," ISCA, 2020.