Systems for Scalable Graph Analytics and Machine Learning: Trends and Methods
Systems for Scalable Graph Analytics and Machine Learning: Trends and Methods
News
[08/29/2024] We present the tutorial at KDD'24, Barcelona!
[07/31/2024] Our slides are out. Please check here.
[07/16/2024] We will upload our slides soon. Please stay tuned.
Graph-theoretic algorithms and graph machine learning models are essential tools for addressing many real-life problems, such as social network analysis and bioinformatics. To support large-scale graph analytics, graph-parallel systems have been actively developed for over one decade, such as Google’s Pregel and Spark’s GraphX, which (i) promote a think-like-a-vertex computing model and target (ii) iterative algorithms and (iii) those problems that output a value for each vertex. However, this model is too restricted for supporting the rich set of heterogeneous operations for graph analytics and machine learning that many real applications demand.
In recent years, two new trends emerge in graph-parallel systems research: (1) a novel think-like-a-task computing model that can efficiently support the various computationally expensive problems of subgraph search; and (2) scalable systems for learning graph neural networks. These systems effectively complement the diversity needs of graph-parallel tools that can flexibly work together in a comprehensive graph processing pipeline for real applications, with the capability of capturing structural features. This tutorial will provide an effective categorization of the recent systems in these two directions based on their computing models and adopted techniques, and will review the key design ideas of these systems.
This figure summarizes a typical pipeline for graph processing, consisting of a graph analytics phase and an optional graph machine learning (ML) phase. The analytic tasks either concern individual vertices (e.g., node scoring or classification), or concern substructures or even an entire graph (e.g., dense/frequent subgraph mining, graph classification). There are four analytics paths in the pipeline:
Vertex Analytics, which outputs a score for each vertex, useful for applications such as biomolecule prioritization in network biology, or object ranking in recommender systems.
Vertex Analytics + ML, where the analytics stage outputs vertex embeddings for downstream ML tasks. Vertex embeddings can be learned from the graph topology as in DeepWalk and node2vec, or the vertex features may come directly from the applications or be computed based on the graph topology (e.g., in- and out-degrees, clustering coefficient).
Structure Analytics, which outputs subgraph structures (patterns or instances), useful for finding functional groups in network biology, and community detection.
Structure Analytics + ML, where informative structures are extracted as features for graph classification/regression.
In recent years, two new trends emerge in graph-parallel systems research: (1) a lot of novel systems have been recently developed targeting the more compute-intensive subgraph search problems, which all adopt a subgraph-centric programming model (in contrast to vertex-centric); (2) graph neural networks (GNN) have boomed in various applications, and a number of scalable GNN systems have been developed. These two directions well cover the "Graph Structures" and "ML" components of the pipeline, but there currently lacks a comprehensive tutorial to survey and introduce these exciting new system advancements.
This tutorial aims to fill this gap. We have offered a tutorial on graph-parallel systems in SIGMOD 2016 but it mainly focused on TLAV systems. There were also recent tutorials on GNNs but they focused on model design in real applications. In contrast, this tutorial introduces the parallel/distributed system design aspects of subgraph search and GNNs, so is unique and timely.
We remark that the two topics we cover here are related and important in order to fully explore the potential of various graph analytics tools in a real application pipeline. For example, frequent subgraph structural patterns have been found informative in conventional models for graph classification and regression. ML applications benefiting from having structural features include biochemistry, bioinformatics, and community detection, where structural features are found to outperform neural graph embedding methods. There are also works applying GNNs for approximate subgraph search, such as neural subgraph matching and neural subgraph counting, where considering subgraph structures were found essential to achieve good performance. Finally, Subgraph GNNs which model graphs as collections of subgraphs are found to be more expressive than regular GNNs.
This tutorial will be delivered by Dr. Da Yan from the Department of Computer Science at Indiana University Bloomington, and his graph system team members including his PhD students Lyuheng Yuan and Saugat Adhikari, and his Assistant Scientist Akhlaque Ahmad.
Dr. Yan and his team have developed graph-analytics systems including G-thinker, PrefixFPM, T-FSM, G2-AIMD and T-DFS, and are the pioneers of the think-like-a-task computing model T-thinker. Dr. Yan also has extensive experience in developing think-like-a-vertex (TLAV) graph systems and GNN. Dr. Yan and his team have a long history of developing graph-parallel systems, with dozens of related publications in top conferences such as SIGMOD, VLDB, KDD, ICDE, and top journals such as ACM TODS, VLDB Journal, IEEE TPDS and IEEE TKDE. Dr. Yan led the development of the BigGraph@CUHK platform with many well-known systems following the TLAV model, a tutorial on TLAV systems in SIGMOD 2016, and books on graph-parallel systems with prestigious publishers. Dr. Yan is the sole winner of Hong Kong 2015 Young Scientist Award in Physical/Mathematical science, and his graph systems research was funded by the DOE Office of Science Early Career Research Program in 2023.
All presenters are from Department of Computer Science, Indiana University Bloomington, IN, USA.
Da Yan, yanda@iu.edu
Lyuheng Yuan, lyyuan@iu.edu
Akhlaque Ahmad, akahmad@iu.edu
Saugat Adhikari, adhiksa@iu.edu