The 1st International Workshop on Architecture for Graph Processing
Keshav Pingali, The University of Texas at Austin
Abstract: We describe Glue, a communication substrate for bulk-synchronous distributed graph analytics that can support heterogeneity in programming models, partitioning policies, and processor types (CPU/GPU). Glue can be interfaced with existing shared-memory graph analytics systems with minimal effort, enabling these systems to operate on clusters. Glue incorporates communication optimizations explicitly targeted for the irregular communication patterns of graph analytics applications.
To demonstrate Glue's ability to support heterogeneous programming models, we have interfaced Glue with Galois and Ligra, two shared-memory graph analytics systems, to produce distributed-memory versions of these systems. We show that these systems deliver significantly better performance and scaling than existing distributed-memory graph analytics systems. To demonstrate Glue's ability to support heterogeneous processors, we interfaced Glue with IrGL, a state-of-the-art single-GPU system for graph analytics to produce the first ever multi-GPU distributed-memory graph analytics system. Experiments on a heterogeneous cluster show that CPU-only and GPU-only Glue give a geometric mean speedup of 5.8 and 9.6 over the state-of-the-art Gemini system. This is joint work with Marc Snir's group at UIUC.
Biography: Keshav Pingali is a Professor in the Department of Computer Science at the University of Texas at Austin, and he holds the W.A."Tex" Moncrief Chair of Computing in the Institute for Computational Engineering and Sciences (ICES) at UT Austin. He was on the faculty of the Department of Computer Science at Cornell University from 1986 to 2006, where he held the India Chair of Computer Science.
Pingali is a Fellow of the ACM, IEEE and AAAS, and a Distinguished Alumnus of IIT Kanpur, India. He was the co-Editor-in-chief of the ACM Transactions on Programming Languages and Systems (TOPLAS), and currently serves on the editorial boards of the ACM Transactions on Parallel Computing, the International Journal of Parallel Programming and Distributed Computing. He has also served on the NSF CISE Advisory Committee (2009-2012).
Christopher De Sa, Stanford University
Abstract: As machine learning problems become larger and more widely used, there is an increasing need for algorithms and systems that can solve large-scale problems quickly. In this setting, the most popular algorithm used is stochastic gradient descent, or SGD. SGD and related methods are efficient because they update their model state randomly based on only a small amount of data, which allows them to start making progress without having to process the entire input dataset. Because SGD is random and accesses data in a non-uniform way, there are many opportunities for increasing the throughput of SGD by taking advantage of its particular statistical and systems properties. In this talk, I will discuss SGD and ways to improve its performance on CPUs using both software changes and architectural improvements, with an applications focus on large-scale graph problems such as recommender systems.
Daniel Sanchez, MIT CSAIL
Abstract: With Moore's Law coming to an end, architects must find ways to sustain performance growth without technology scaling. The most promising path is to build highly parallel systems that harness thousands of simple and efficient cores. But this approach will require new techniques to make massive parallelism practical, as current multicores fall short of this goal: they squander most of the parallelism available in applications and are hard to program.
I will present Swarm, a new architecture that successfully parallelizes algorithms that are often considered sequential and is much easier to program than conventional multicores. Swarm programs consist of tiny tasks, as small as tens of instructions each. Parallelism is implicit: all tasks follow a programmer-specified total or partial order, eliminating the correctness pitfalls of explicit synchronization (e.g., deadlock, data races, etc.). To scale, Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover enough parallelism.
Swarm builds on decades of work on speculative architectures and contributes new techniques to scale to large core counts, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered task commits. Swarm also incorporates new techniques to exploit locality and to harness nested parallelism, making parallel algorithms easy to compose and uncovering abundant parallelism in large applications.
Swarm accelerates challenging irregular applications from a broad set of domains, including graph analytics, machine learning, simulation, and databases. At 256 cores, Swarm is 53-561x faster than a single-core system, and outperforms state-of-the-art software-only parallel algorithms by one to two orders of magnitude. Besides achieving near-linear scalability, the resulting Swarm programs are almost as simple as their sequential counterparts, as they do not use explicit synchronization.
Biography: Daniel Sanchez is an Assistant Professor of Electrical Engineering and Computer Science at MIT. His research interests include parallel computer systems, scalable and efficient memory hierarchies, architectural support for parallelization, and architectures with quality-of-service guarantees. He earned a Ph.D. in Electrical Engineering from Stanford University in 2012 and received the NSF CAREER award in 2015.
Tony Nowatzki, University of California, Los Angeles
Abstract: Key members of industry have begun adoption of application and domain-specific accelerators due to order of magnitude benefits, calling into question what role programmable and general purpose architectures (eg. SIMD, GPGPUs) will play in the era of acceleration. One possible answer is that new architectures and hardware/software interfaces are required for enabling efficient hardware to be exposed in a flexible-enough way.
The talk will discuss work in this area, detail our proposed accelerator-ISA and execution model called stream data flow, and show how it can enable architectures with efficiency close to that of custom accelerators in certain domains. While the proposed hardware abstractions are very attractive and effective for structured workloads, they pose significant challenges for more irregular but very important workloads including graph processing.
This talk will conclude with a set of challenges, some possible steps forward, and a discussion of what a truly general purpose accelerator might look like.
Biography: Tony Nowatzki is an assistant professor in the Department of Computer Science at the University of California, Los Angeles. His research interests include architecture and compiler co-design and mathematical modeling. He received a PhD in computer science from the University of Wisconsin-Madison. Contact him at firstname.lastname@example.org