2020 ScalingGraphNeuralNetworkswithA

(Bojchevski et al., 2020) ⇒ Aleksandar Bojchevski, Johannes Klicpera, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. (2020). “Scaling Graph Neural Networks with Approximate PageRank.” In: arXiv preprint arXiv:2007.01570.

Subject Headings: Graph neural networks (GNNs); PPRGo Model; Personalized PageRank (PPR).

Notes

Cited By

http://scholar.google.com/scholar?q=%222020%22+Scaling+Graph+Neural+Networks+with+Approximate+PageRank

Quotes

Abstract

Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, learning on large graphs remains a challenge - many recently proposed scalable GNN approaches rely on an expensive message-passing procedure to propagate information through the graph. We present the PPRGo model which utilizes an efficient approximation of information diffusion in GNNs resulting in significant speed gains while maintaining state-of-the-art prediction performance. In addition to being faster, PPRGo is inherently scalable, and can be trivially parallelized for large datasets like those found in industry settings. We demonstrate that PPRGo outperforms baselines in both distributed and single-machine training environments on a number of commonly used academic graphs. To better analyze the scalability of large-scale graph learning methods, we introduce a novel benchmark graph with 12.4 million nodes, 173 million edges, and 2.8 million node features. We show that training PPRGo from scratch and predicting labels for all nodes in this graph takes under 2 minutes on a single machine, far outpacing other baselines on the same graph. We discuss the practical application of PPRGo to solve large-scale node classification problems at Google.

1 1 Introduction

Graph Neural Networks (GNNs) excel on a wide variety of network mining tasks from semi-supervised node classification and link prediction [25, 32, 44, 55] to community detection and graph classification [3, 15, 22, 37]. The success of GNNs on academic datasets has generated significant interest in scaling these methods to larger graphs for use in real-world problems [13, 14, 16, 21, 25, 27, 41, 54]. Unfortunately, there are few large graph baseline datasets available; apart from a handful of exceptions [16, 54], the scalability of most GNN methods has been demonstrated on graphs with fewer than 250K nodes. Moreover, the majority of existing work focuses on improving scalability on a single machine. Many interesting network mining problems involve graphs with billions of nodes and edges that require distributed computation across many machines. As a result, we believe most of the current literature does not accurately reflect the major challenges of large scale GNN computing.

The main scalability bottleneck of most GNNs stems from the recursive message-passing procedure that propagates information through the graph. Computing the hidden representation for a given node requires joining information from its neighbors, and the neighbors in turn have to consider their own neighbors, and so on. This process leads to an expensive neighborhood expansion, growing exponentially with each additional layer.

In many proposed GNN pipelines, the exponential growth of neighborhood size corresponds to an exponential IO overhead. A common strategy for scaling GNNs is to sample the graph structure during training, e.g. sample a fixed number of nodes from the k-hop neighborhood of a given node to generate its prediction [25, 54]. The key differences between many scalable techniques lies in the design of the sampling scheme. For example, Chen et al. [13] directly sample the receptive field for each layer using importance sampling, while Chen et al. [14] use the historical activations of the nodes as a control variate. Huang et al. [27] propose an adaptive sampling strategy with a trainable sampler per layer, and Chiang et al. [16] sample a block of nodes corresponding to a dense subgraph identified by the clustering algorithm METIS [30]. Because these approaches still rely on a multi-hop message passing procedure, there is an extremely steep trade-off between runtime and accuracy. Unfortunately, for many of the proposed methods sampling does not directly reduce the number of nodes that need to be retrieved, since e.g. we have first have to compute the importance scores [13].

Recent work shows that personalized PageRank (Jeh & Widom, 2003) can be used to directly incorporate multi-hop neighborhood information of a node without explicit message-passing [ 33 ]. Intuitively, propagation based on personalized PageRank corresponds to infinitely many neighborhood aggregation layers where the node influence decays exponentially with each layer. However, as proposed, Klicpera et al., 2019)’s approach does not easily scale to large graphs since it performs an expensive variant of power iteration during training.

In this work, we present PPRGo, a GNN model that scales to large graphs in both single and multi-machine (distributed) environments by using an adapted propagation scheme based on approximate personalized PageRank. Our approach removes the need for performing expensive power iteration during each training step by utilizing the (strong) localization properties [23, 36] of personalized PageRank vectors for real-world graphs. These vectors can be readily approximated with sparse vectors and efficiently pre-computed in a distributed manner [5]. Using the sparse pre-computed approximations we can maintain the influence of relevant nodes located multiple hops away without prohibitive message-passing or power iteration costs. We make the following contributions:

We introduce the PPRGo model based on approximate personalized PageRank. On a graph of over 12 million nodes, PPRGo runs in under 2 minutes on a single machine, including pre-processing, training and inference time.
We show that PPRGo scales better than message-passing GNNs, especially with distributed training in a real-world setting.
We introduce the MAG-Scholar dataset (12.4M nodes, 173M edges, 2.8M node features), a version of the Microsoft Academic

Graph that we augment with "ground-truth" node labels. The dataset is orders of magnitude larger than many commonly used benchmark graphs.

Most previous work exclusively focuses on training time. We also show a significantly reduced inference time and furthermore propose sparse inference to achieve an additional 2x speed-up.

2 BACKGROUND

2.1 GNNs and Message-Passing

…

2.3 Related work

Scalability

GNNs were first proposed in [[Gori et al. [ 24]] ] and in [[Scarselli et al. [ 42]] ] and have since emerged as a powerful approach for solving many network mining tasks [ 1, 2, 9, 17, 22, 32, 42, 44 ]. Most GNNs do not scale to large graphs since they typically need to perform a recursive neighborhood expansion to compute the hidden representations of a given node. While several approaches have been proposed to improve the efficiency of graph neural networks [ 13, 14, 16, 21, 25, 27, 41, 49, 54 ], the scalability of GNNs to massive (web-scale) graphs is still under-studied. As we discussed in § 1 the most prevalent approach to scalability is to sample a subset of the graph, e.g. based on different importance scores for the nodes [ 16, 21, 25, 41, 54].3 Beyond sampling, [[Gao et al. [ 21 ] collect the representation]]s from a node’s neighborhood into a matrix, sort independently along each column / feature, and use the k largest entries as input to a 1-dimensional CNN. These techniques all focus on single-machine environments with limited (GPU) memory.

Buchnik and Cohen [10] propose feature propagation which can be viewed as a simplified linearized GNN. They perform graphbased smoothing as a preprocessing step (before learning) to obtain diffused node features which are then used to train a logistic regression classifier to predict the node labels. Wu et al. [49] propose an equivalent simple graph convolution (SGC) model and diffuse the features by multiplication with the k-th power of the normalized adjacency matrix. However, node features are often high dimensional, which can make the preprocessing step computationally expensive. More importantly, while node features are typically sparse, the obtained diffused features become denser, which significantly reduces the efficiency of the subsequent learning step. Both of these approaches are a special case of the PPNP model [33] which experimentally shows higher classification performance [18, 33].

Approximating PageRank.

Recent approaches combine basic techniques to create algorithms with enhanced guarantees [ 35, 45, 46 ]. For example Wei et al. (48) propose the TopPPR algorithm combining the strengths of random walks and forward / backward search simultaneously. They can compute the top k entries of a personalized PageRank vector up to a given precision using a filterand-refine paradigm. Another family of approaches (20) are based on the idea of maintaining upper and lower bounds on the PageRank scores which are then used for early termination with certain guarantees. For our purpose the basic techniques are sufficient.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2020 ScalingGraphNeuralNetworkswithA	Bryan Perozzi Stephan Günnemann Aleksandar Bojchevski Johannes Klicpera Amol Kapoor Martin Blais Michal Lukasik Benedek Rózemberczki			Scaling Graph Neural Networks with Approximate PageRank