Graph Neural Networks

Neural networks (NNs) are information processing architectures that take a given input and process it through a cascade of layers, each of which applies a linear transform, followed by an activation function (typically, a pointwise nonlinearity). These architectures are used to model mappings between a given input and a desired, target output. They offer a non-linear mapping that also allows for efficient training. That is, given a training set, the optimal linear transforms are determined by minimizing some cost function over this training set. A neural network model has learned when, after trained, it generalizes well to unseen samples (i.e. given a sample that was not in the training set, the output of the neural network still exhibits good performances -according to some evaluation measure-).

We note that the linear transforms, which contains the parameters that we learn (i.e. the elements of the linear transform, called parameters, are obtained by optimization over the training set), depend on the size of the input data. Thus, neural networks do not scale, since large input data gives rise to several issues such as the curse of dimensionality, the need of a very large training set, an excessive computational cost, etc. The way to overcome this issue, is to regularize the linear transform, so as to reduce the search space. This regularization is most effective when it exploits the structure of the data (or any external information we may have on the input). Particularly successful have been the convolutional neural networks (CNNs), which operate on time signals and images, and have regularized the linear transform by forcing it to be a bank of small-support filters. The output of each filter can be efficiently computed by means of the convolution operation (reduces computational cost), the fact that the filters have a small-support controls the number of learnable parameters (independent of the size of the output data), and the fact that convolutions are translation invariant exploits the regular structure of the data (we can argue that this works as a data augmentation technique, artificially increasing the given dataset). CNNs have shown remarkable success in many classification and regression tasks involving data in the format of time signals and images.

Fig. Examples of network data that exhibit an irregular structure described by a graph. (i) Team of robots. There is a communication network and the robots talk to each other to coordinate on solving some task. (ii) Smart Grids. Given the power network the objective is to use the state of the grid to decide on power generation schemes. (iii) Remote sensing. There is a set of satellites measuring different aspects of the same phenomena and these measurements are related to each other depending on the relative position of the satellites. (iv) Smart cities. The objective is for cars to coordinate among themselves to improve traffic. (Images taken from Google Image Search, credits to the corresponding authors).

We are interested, however, in network data, which can rarely be described in terms of a regular structure such as time or images. As such, regular CNNs do not work anymore, since they fail to exploit this irregular structure present in network data. In what follows we discuss the concept of graph neural networks (GNNs) which are neural networks that regularize the linear transform exploiting the underlying graph structure that describes network data.

L. Ruiz, F. Gama, and A. Ribeiro, "Graph Neural Networks: Architectures, Stability and Transferability," Proc. IEEE, vol. 109, no. 5, pp. 660–682, May 2021.

F. Gama, E. Isufi, G. Leus, and A. Ribeiro, "Graphs, Convolutions, and Neural Networks: From Graph Filters to Graph Neural Networks," IEEE Signal Process. Mag., vol. 37, no. 6, pp. 128-138, Nov. 2020.

Graph Convolutional Neural Networks

A way to regularize a neural network to take into account the underlying graph structure, is to use a bank of graph filters. In particular, linear shift-invariant graph filters define a graph convolution, analogously as a linear time-invariant filter defines a regular convolution. More precisely, while a regular convolution is a weighted average of (time- or space-)shifted versions of the input signal, the graph convolution is defined as a weighted average of graph-shifted versions of the input graph signal. Additionally, graph convolutions also satisfy the convolution theorem, by which a convolution in the graph domain can be computed as an elementwise multiplication in the frequency domain. A graph convolution can be viewed as a weighing of the information contain in further away-neighborhoods. As such, each filter tap in the convolution weights a summary of the information of each neighborhood independently. These filter taps are the parameters that are learned from the data, so that the graph convolutional neural network (GCNN) essentially learns how to correctly harvest information from the neighbors. The number of filter taps can be defined independently of the size of the graph. Moreover, the graph convolution exploits the underlying graph structure, adequately regularizing the linear transform, and can also be computed efficiently by means of local exchanges of information only.

Fig. Graph convolution. We denote as x the input data, as S the graph structure (graph shift operator, i.e. adjacency matrix, or graph Laplacian, or any other matrix that respects the sparsity of the graph) and as h_k the corresponding filter taps. It is key to observe that the data x is a signal supported on top of a graph S (i.e. the graph signal x is shown as the small dotted bar on top of each node, and the nodes and the edges represent the supporting graph). We note that, since S is a matrix that respects the sparsity of the graph (i.e. there is a nonzero element only where there is an edge in the corresponding graph), the operation Sx involves only exchanges with one-hop neighborhood, and computes a weighted average of the values of the graph signal x in the neighboring nodes. It is shown in the figure how repeated application of S entails repeated exchanges of information with the neighbors, and brings summaries from further away neighborhoods. Ultimately, the graph convolution computes a weighted average of the information located in each of these neighborhoods.

The general idea of using a bank of graph filters is to learn how to extract useful features that describe the information contained in the input data, tailored to solve the task at hand. Typically, we learn an ever growing number of features as the GNN becomes deeper (more layers). This implies that the dimension of the input at each layer (given by the number of nodes and the number of features) keeps increasing. To control this dimension, we can reduce the number of nodes while we increase the number of features, trading off spatial information for features. This is achieved by the operation of pooling, which essentially builds regional summaries of information that are kept in exchange of more detailed local information. One way of building this summaries is by using graph coarsening, which is a technique to build sequentially smaller graphs, and then mapping pairs of nodes in the bigger graph into a single node in the smaller graph, and recomputing the connections that these new nodes should have.

However, oftentimes, the given graph represents a physical network (sensor networks, power grids, teams of autonomous agents), and as such, cannot be arbitrarily changed to represent a smaller graph (i.e. the given nodes and the connections among them are fixed and cannot be arbitrarily altered). If this is the case, a technique such as graph coarsening cannot be applied, since the computations happening at deeper layer (on smaller graphs) cannot actually be implemented by means of local exchanges between the nodes. In order to keep the original graph but reduce the number of computations, we can think of the pooling operation as a summarizing function, that takes information from the neighborhood, followed by a downsampling operation that selects some of the nodes as representatives of the given region. Then, after only some of the nodes have been selected, we can use zero-padding to be able to compute graph convolutions in different layers, always over the same given graph. That is, the nodes that were not selected are set to zero, acting only as relay stations, while the features are updated at the subset of nodes that was selected. Then, only the values at these nodes are kept as the output of each graph convolution, ignoring the nodes that were previously set to zero. This architecture, dubbed selection GNN, implements a deep, graph convolutional neural network, in an entirely local fashion (using only -repeated- one-hop communication with the neighbors), exploiting the graph structure of the data, while simultaneously reducing the computational complexity by operating on fewer nodes. Moreover, since the same graph is used throughout the layers, all the outputs can be interpreted under the same graph frequency domain.

Fig. Selection GNN. Each row represents a (deeper) layer, while the first column represents the input to the layer, the second column the graph convolution operation, and the third column the pooling and zero-padding operation. The nodes that have been greyed-out represent the nodes that have been set to zero by the downsampling operation. The size of the disks is a representative of the area of influence of either the convolution (second column) or the summarizing function in the pooling stage (third column).

Selection GNN understands the graph convolution as a diffusion operation, whereby the information at each node gets diffused through the graph by repeated communications with one-hop neighbors, and then each step in this diffusion is weighed differently by a filter tap. Alternatively, we can think of the graph convolution as an aggregation process. Consider a single node, and start with the value of the signal at that node, saving it as the first element in the aggregation sequence. Then, exchange information with the neighbors and compute the weighted average of their values, and save that value as the second element in your aggregation sequence. Third, exchange information with your neighbors again, and save that as the third element of your sequence, and so on. Each element in the aggregation sequence represents a summary of the information at a further distant neighborhood. But more importantly, we note that this aggregation sequence vector has a regular structure. That is, contiguous elements of the sequence represent information of contiguous neighborhoods. Therefore, the aggregation sequence is a regular-structured vector that, simultaneously, takes into account the underlying irregular, graph support. As such, we can operate on the aggregation sequence with any regular operation, in particular a regular convolution, and a regular CNN. In summary, the aggregation GNN, consists in building the aggregation sequence by sequentially obtaining neighborhood summaries, and then feeding the resulting regular-structured vector into a CNN. This architecture is also entirely local (in fact, it could be computed only at a single node, if enough exchanges are performed), and relies on the computational power of each node to carry out the corresponding CNNs and learn the useful features.

Fig. Aggregation GNN. We build the aggregation sequence by repeatedly exchanging information with the neighbors, building summaries from further away neighborhoods. The resulting sequence is a regular-structured signal, where contiguous elements represent the information from nearby neighborhoods. Then, this regular-structured sequence can be processed by regular operations, in particular, a CNN, to learn the features describing the graph signal, while exploiting the underlying graph structure as well.

F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, "Convolutional Neural Networks Architectures for Signals Supported on Graphs," IEEE Trans. Signal Process., vol. 67, no. 4, pp. 1034-1049, 15 Feb. 2019.

F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, "Convolutional Graph Neural Networks," in 53rd Asilomar Conf. Signals, Systems and Comput. Pacific Grove, CA: IEEE, 3-6 Nov. 2019, pp. 452-456.

F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, "Aggregation Graph Neural Networks," in 44th IEEE Int. Conf. Acoust., Speech and Signal Process. Brighton, UK: IEEE, 12-17 May 2019, pp. 4943-4947.

F. Gama, A. G. Marques, A. Ribeiro, and G. Leus, "MIMO Graph Filters for Convolutional Networks," in 19th IEEE Int. Workshop Signal Process. Advances in Wireless Commun. Kalamata, Greece: IEEE, 25-28 June 2018, pp. 1-5.

Stability Properties

In many practical situations, the underlying graph support could be unknown, and therefore has to be estimated, or it can change (slightly) with time, as is the case of teams of autonomous agents, where their communication network changes as the agents move. In such cases, it is of paramount importance to determine how robust the GNN is to changes in the underlying topology. In this project, we study the stability of a GNN to changes in its underlying topology. In other words, given an input, and a set of filter coefficients for all the filters in the GNN, how much does the output change, when the GNN operates in one graph, or another? We note that the results derived in these papers also have a big impact in transfer learning, since training in one graph and testing in another, and still achieving good performance, is equivalent to determining how stable the GNN is to changes in the graph.

First, we consider changes in the graph that are permutations. That is, we study the output of the GNN on two graphs, one of which is a permutation (node reordering) of the other one. We prove that the output of the GNN on the permuted graph is a permutation of the output on the original graph. The immediate consequence of this is that the GNN is independent to the node ordering, which is good news, since the node ordering is arbitrary. But more importantly, this result shows that GNNs exploit the topological symmetries of the underlying graph support. That is, if there are parts of the graph that have the same network topology, then the output on one part would be the same as the other one (whenever the input signal is the same in both places). This acts as a data-augmentation technique, since learning how to process data in one part of the graph, would let the GNN know how to process that data in any other part of the graph that has the same topological symmetries. This property is analogous to the translational equivariance of regular CNNs.

Fig. Example of permutation equivariance. Consider the graph on the left as the given graph. The labeling is determined by the numbers inside each node. The corresponding graph signal is given by the colors on the nodes. The first and second graph look exactly the same, except for the node labeling. Therefore, we would expect the output of filtering on both graphs to be exactly the same, respecting the corresponding node order. This holds following the proof of permutation equivariance. More importantly, though, is to consider the third graph, on the right. This graph has the same ordering as the given (leftmost) graph, therefore it's the same graph support. However, the signal on top of this graph is different. So we have two different signals on the same graph. We observe now, that if we take the rightmost graph, turn it inside out, and rotate it to the left, then we end up with a graph equivalent to the second (middle) graph, which is, in turn, the same as the first one. Therefore, the output of a filter in the first graph and the third graph is also the same, even though the signal is different. This is due to the topological symmetries of the graph. In essence, we can observe the signal on the first (leftmost) graph, and learn how to process the signal on the third (rightmost) graph, even if we never observed it. Thus, the permutation equivariance serves for data-augmentation. We note that in this illustrative graph, the symmetry is perfect. In real graphs we would not expect this symmetry to be perfect, but there could be some parts of the graph where the symmetry does exist.

Next, we consider more general graph perturbations, like changes in the edge weights (for example, if the edge weights get smaller due to agents drifting away and the respective communication link getting weaker). We measure the changes in the output of the GNN, modulo permutations, in light of the above observations. We then prove that, if the graph filters composing the filter bank are Lipschitz integral, then the GNN is unstable. However, if these filters are integral Lipschitz (i.e. the integral of these filters is Lipschitz continuous), then they are stable, as we prove in this work. Integral Lipschitz filters force a flat response in high graph frequencies, which implies that these kind of filters are unable to discriminate between information located in high frequencies. In essence, linear graph filters, are either stable (integral Lipschitz filters) or selective (Lipschitz filters), but cannot be both. This is an inherent limitation of linear filtering. This limitation can be overcome by using pointwise nonlinearities that spread the information throughout the frequency spectrum, in particular in low graph frequencies, where it can then be discriminated in a stable fashion. And, since a linear graph filter followed by a pointwise nonlinearity is none other than a GNN, then we find out that GNNs are both stable and selective information processing architectures, offering a considerable performance advantage over linear methods.

Fig. Consider a given set of filter taps. The graph Fourier transform (GFT) of the filter is a polynomial, whose coefficients are given by these filter taps. Observe then, that the GFT filter function is independent of the actual graph, since it only depends on the filter taps, and is given here by the solid black line. Once we are given a graph, we instantiate this GFT filter function on the eigenvalues of the given graph. For example, we have that for the given graph, the filter GFT coefficients are given by instantiating the GFT filter function on the blue eigenvalues, while the filter GFT coefficients for the perturbed graph, are given by instantiating the GFT filter function on the red eigenvalues.
Fig. When we consider perturbations on the edge weights, we observe that the the high-eigenvalue frequencies get perturbed proportional to their value. This means that, even if the perturbation is small, the high-eigenvalue frequencies get perturbed a lot. Therefore, no matter how small the perturbation is, the output of a Lipschitz filter (left image) changes a lot, making them unstable. On the other hand, integral Lipschitz filter (above image), force a flat response in high-eigenvalue frequencies. This causes the integral Lipschitz filters to be stable, since a big change in the high-eigenvalue frequencies, still outputs the same value.
Fig. Integral Lipschitz filter have a drawback, though. Consider that we have two signals we want to discriminate. One has all its information located in the highest eigenvalue, the other one has all the information in the second-to-last eigenvalue. In order to discriminate between them, we need to design very selective filters. However, if this filters are selective, then they won't be stable, since a small perturbation will throw the eigenvalues out of the passband. On the other hand, if we use integral Lipschitz filters, which we know are stable, then they cannot be selective, and won't be able to discriminate between these two signals with high-eigenvalue frequency content. Thus, integral Lipschitz filter are stable, but not selective (while Lipschitz filters are selective, but not stable).
Fig. When we apply a nonlinearity, we are causing the energy to spill throughout the frequency spectrum. In particular, the spillage that occurs in low-eigenvalue frequencies, can now be discriminated in a stable fashion with linear filters. But, the combination of a nonlinearity with a graph filter, is none other than a GNN. Therefore, GNNs are information processing architectures that are simultaneously stable and selective. We note that, since we cannot control the exact shape of the spillage into low-eigenvalue frequencies, we train the filter taps so that they learn how to interact with the specific chosen nonlinearity, in order to extract the useful features that the signal presents, irrespective of their frequency location.

F. Gama, J. Bruna, and A. Ribeiro, "Stability Properties of Graph Neural Networks," IEEE Trans. Signal Process., vol. 68, pp. 5680-5695, 25 Sep. 2020.

S. Pfrommer, F. Gama, and A. Ribeiro, "Discriminability of Single-Layer Graph Neural Networks," in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021, pp. 8508–8512.

F. Gama, J. Bruna, and A. Ribeiro, "Stability of Graph Neural Networks to Relative Perturbations," in 45th IEEE Int. Conf. Acoust., Speech and Signal Process. Barcelona, Spain: IEEE, 4-8 May 2020, pp. 9070-9074.

F. Gama, J. Bruna, and A. Ribeiro, "Stability of Graph Scattering Transforms," in 33rd Conf. Neural Inform. Process. Syst. Vancouver, BC: Neural Inform. Process. Syst. Foundation, 8-14 Dec. 2019, pp. 8038-8048.

F. Gama, A. Ribeiro, and J. Bruna, "Diffusion Scattering Transforms on Graphs," in 7th Int. Conf. Learning Representations, New Orleans, LA, 6-9 May 2019, pp. 1–12.

Graph Recurrent Neural Networks

Graph processes exhibit a temporal structure determined by the sequence index and and a spatial structure determined by the graph support. To learn from graph processes, an information processing architecture must then be able to exploit both underlying structures. Graph Recurrent Neural Networks (GRNNs) act as a general learning framework that achieves this goal by leveraging the notion of a recurrent hidden state together with graph signal processing (GSP). In the GRNN, the number of learnable parameters is independent of the length of the sequence and of the size of the graph, guaranteeing scalability. GRNNs are permutation equivariant and stable to perturbations of the underlying graph support. To address the problem of vanishing gradients, gated GRNNs can be developed, with three different gating mechanisms: time, node and edge gates. In numerical experiments involving both synthetic and real datasets, time-gated GRNNs are shown to improve upon GRNNs in problems with long term dependencies, while node and edge gates help encode long range dependencies present in the graph. The numerical results also show that GRNNs outperform GNNs and RNNs, highlighting the importance of taking both the temporal and graph structures of a graph process into account.

L. Ruiz, F. Gama, and A. Ribeiro, "Gated Graph Recurrent Neural Networks," IEEE Trans. Signal Process., vol. 68, pp. 6303-6318, 26 Oct. 2020.

L. Ruiz, F. Gama, and A. Ribeiro, "Spatial Gating Strategies for Graph Recurrent Neural Networks," in 45th IEEE Int. Conf. Acoust., Speech and Signal Process. Barcelona, Spain: IEEE, 4-8 May 2020, pp. 5550–5554.

L. Ruiz, F. Gama, and A. Ribeiro, "Gated Graph Convolutional Recurrent Neural Networks," in 27th Eur. Signal Process. Conf. A Coruña, Spain: Eur. Assoc. Signal Process., 2-6 Sep. 2019, pp. 1–5, best student paper award.

Non-convolutional Neural Networks

The use of convolutional graph filters to regularize the linear transform in neural networks, giving rise to GCNNs, can be extended to other types of graph filters, leading to more general GNNs. First, we consider nonlinear graph filters. We observe that, while the use of pointwise activation functions is a sensible choice in regular structure data, where all the neighborhoods look topologically the same, in more arbitrary irregular structure, there could be useful information to be learned from a nonlinear operation that takes into account this topology. Indeed, by using local nonlinear activation functions (as opposed to pointwise ones) we can significantly increase the descriptive power of the architecture. Moreover, these nonlinear activation functions can also be trained to adapt to the current task. Finally, we note that these local activation functions also preserve the permutation equivariance property, effectively generalizing convolution to nonlinear domains.

Alternatively, we can replace the linear shift-invariant graph filters (which give rise to graph convolutions), by other, more general, linear operations. In fact, the most general linear operation that one can do, taking into account the underlying graph structure, is to filter with an edge-varying graph filter. Edge-varying graph filters assign a different filter tap to each edge in the network, allowing each node to learn how to best process the information incoming from each direction. Clearly, the number of learnable parameters in an edge-varying graph filter depends on the number of nodes and edges, and as such, it doesn't scale. As a result, we also propose different ways to regularize these filters, while still achieving edge-level detail of information.

E. Isufi, F. Gama, and A. Ribeiro, "EdgeNets: Edge Varying Graph Neural Networks," arXiv:2001.07620v3 [cs.LG], 27 July 2021, submitted to IEEE Trans. Pattern Anal. Mach. Intell.

L. Ruiz, F. Gama, A. G. Marques, and A. Ribeiro, "Invariance-Preserving Localized Activation Functions for Graph Neural Networks," IEEE Trans. Signal Process., vol. 68, pp. 127-141, 25 Nov. 2019.

F. Gama, B. G. Anderson, and S. Sojoudi, “Node-Variant Graph Filters in Graph Neural Networks,” arXiv:2106.00089v1 [cs.LG], 31 May 2021, submitted to 35th Conf. Neural Inform. Process. Syst.

L. Ruiz, F. Gama, A. Ribeiro, and E. Isufi, "Nonlinear State-Space Generalizations of Graph Convolutional Neural Networks," in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021, pp. 5265-5269.

E. Isufi, F. Gama, and A. Ribeiro, "Generalizing Graph Convolutional Neural Networks with Edge-Variant Recursions on Graphs," in 27th Eur. Signal Process. Conf. A Coruña, Spain: Eur. Assoc. Signal Process., 2-6 Sep. 2019, pp. 1-5.

L. Ruiz, F. Gama, A. G. Marques, and A. Ribeiro, "Median Activation Functions for Graph Neural Networks," in 44th IEEE Int. Conf. Acoust., Speech and Signal Process. Brighton, UK: IEEE, 12-17 May 2019, pp. 7440-7447.

F. Gama, G. Leus, A. G. Marques, and A. Ribeiro, "Convolutional Neural Networks via Node-Varying Graph Filters," in 2018 IEEE Data Sci. Workshop. Lausanne, Switzerland: IEEE, 4-6 June 2018, pp. 220-224.

Learning Decentralized Controllers

Dynamical systems comprised of autonomous agents arise in many relevant problems such as multi-agent robotics, smart grids, or smart cities. Controlling these systems is of paramount importance to guarantee a successful deployment. Optimal centralized controllers are readily available but face limitations in terms of scalability and practical implementation. Optimal decentralized controllers, on the other hand, are difficult to find. We can use graph neural networks (GNNs) to learn decentralized controllers from data. GNNs are well-suited for the task, since they are naturally distributed architectures. Furthermore, they are equivariant and stable, leading to good scalability and transferability properties. The problem of flocking is explored to illustrate the power of GNNs in learning decentralized controllers. Most strikingly is the fact that the controllers can be trained in small networks but, once they have learned the controller, they can be successfully deployed in larger networks observing the same performance. This shows the power of GNNs for transferring at scale.

Below, we show videos of sample trajectories. The trajectories are initialized by placing the agents at random on a disk. Each agent starts flying with a random velocity in an arbitrary direction. The agents have learned, by means of a GNN, a decentralized controller such that, based on their own state and that communicated by nearby neighbors, they adjust their acceleration in order to coordinate their velocity to that of the team, so that they all fly together while avoiding collisions. It is noted that the GNNs have been trained on other trajectories, different from the ones seen at test time. All of the videos below showcase a GNN-based controller that has been learned on teams of 50 agents. But then is tested on larger teams without need for retraining. It is easily observed that the performance transfers and scales.

F. Gama, Q. Li, E. Tolstaya, A. Prorok, and A. Ribeiro, "Decentralized Control with Graph Neural Networks," arXiv:2012.14906v2 [cs.LG], 8 June 2021, submitted to IEEE Trans. Signal Process.

Z. Gao, F. Gama, and A. Ribeiro, “Wide and Deep Graph Neural Network with Distributed Online Learning,” arXiv:2107.09203v1 [cs.LG], 19 July 2021, submitted to IEEE Trans. Signal Process.

Z. Gao, F. Gama, and A. Ribeiro, "Wide and Deep Graph Neural Networks with Distributed Online Learning," in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021, pp. 5270–5274.

F. Gama, E. Tolstaya, and A. Ribeiro, "Graph Neural Networks for Decentralized Controllers," in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021, pp. 5260-5264.

E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, "Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks," in Conf. Robot Learning 2019, vol. 100. Osaka, Japan: Proc. Mach. Learning Res., 30 Oct.-1 Nov. 2019, pp. 671-682.

The problem of flocking serves as a proof of concept to show the power of GNNs as means of learning decentralized controllers. For a slightly more realistic setting, we can consider learning controllers based on raw visual observations. We present Vision-based Graph Aggregation and Inference (VGAI), a decentralized learning-to-control framework that directly maps raw visual observations to agent actions, aided by sparse local communication among only neighboring agents. Our framework is implemented by an innovative cascade of convolutional neural networks (CNNs) and one graph neural network (GNN), addressing agent-level visual perception and feature learning, as well as swarm-level local information aggregation and agent action inference, respectively. Using the application example of drone flocking, we show that VGAI yields comparable or more competitive performance with other decentralized controllers, and even the centralized controller that learns from global information. Especially, it shows substantial scalability to learn over large swarms (e.g., 50 agents), thanks to the integration between visual perception and local communication.

Fig. ViGA architecture. Each agent observes omnidirectional images which are processed into a state vector. This state vector is then shared among agents by means of a GNN to decide on a suitable controller for successful flocking.

T.-K. Hu, F. Gama, T. Chen, W. Zheng, Z. Wang, A. Ribeiro, and B. M. Sadler, “Scalable Perception-Action-Communication Loops with Convolutional and Graph Neural Networks,” arXiv:2106.13358v1 [cs.RO], 24 June 2021, submitted to IEEE Trans. Signal, Inform. Process. Networks.

T.-K. Hu, F. Gama, T. Chen, Z. Wang, A. Ribeiro, and B. M. Sadler, "VGAI: End-to-End Learning of Vision-Based Decentralized Controllers for Robot Swarms," in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021, pp. 4900-4904.

Another example of learning decentralized controllers is that of multi-robot path planning. Efficient and collision-free navigation in multi-robot systems is fundamental to advancing mobility. Scenarios where the robots are restricted in observation and communication range call for decentralized solutions, whereby robots execute localized planning policies. From the point of view of an individual robot, however, its local decision-making system is incomplete, since other agents' unobservable states affect future values. The manner in which information is shared is crucial to the system's performance. To address these challenges, we propose a combined architecture, with the goal of learning a decentralized sequential action policy that yields efficient path plans for all robots. Our framework is composed of a convolutional neural network (CNN) that extracts adequate features from local observations, and a graph neural network (GNN) that communicates these features among robots. We train the model to imitate an expert algorithm, and use the resulting model online in decentralized planning involving only local communication. We evaluate our method in simulations involving teams of robots in cluttered workspaces. We measure the success rates and sum of costs over the planned paths. The results show a performance close to that of our expert algorithm, demonstrating the validity of our approach. In particular, we show our model's capability to generalize to previously unseen cases (involving larger environments and larger robot teams).

Q. Li, F. Gama, A. Ribeiro, and A. Prorok, "Graph Neural Networks for Decentralized Multi-Robot Path Planning," in 2020 IEEE/RSJ Int. Conf. Intell. Robots and Syst. Las Vegas, NV: IEEE, 25-29 Oct. 2020, pp. 11785-11792.

Q. Li, F. Gama, A. Ribeiro, and A. Prorok, "Graph Neural Networks for Decentralized Path Planning," in 19th Int. Conf. Autonomous Agents and Multi-Agent Syst. Auckland, New Zealand: IFAAMAS, 9-13 May 2020, pp. 1901-1903.

Controlling network systems has become a problem of paramount importance. Optimally controlling a network system with linear dynamics and minimizing a quadratic cost is a particular case of the well-studied linear-quadratic problem. When the specific topology of the network system is ignored, the optimal controller is readily available. However, this results in a centralized controller, facing limitations in terms of implementation and scalability. Finding the optimal distributed controller, on the other hand, is intractable in the general case. In this paper, we propose the use of graph neural networks (GNNs) to parametrize and design a distributed controller. GNNs exhibit many desirable properties, such as being naturally distributed and scalable. We cast the distributed linear-quadratic problem as a self-supervised learning problem, which is then used to train the GNN-based controllers. We also obtain sufficient conditions for the resulting closed-loop system to be input-state stable, and derive an upper bound on the trajectory deviation when the system is not accurately known. We run extensive simulations to study the performance of GNN-based distributed controllers and show that they are computationally efficient and scalable.

F. Gama and S. Sojoudi, “Distributed Linear-Quadratic Control with Graph Neural Networks,” arXiv:2103.08417v3 [eess.SY], 13 July 2021, submitted to Signal Process.

F. Gama and S. Sojoudi, “Graph Neural Networks for Distributed Linear-Quadratic Control,” in 3rd Annu. Conf. Learning Dynamics Control. Zürich, Switzerland: Proc. Mach. Learning Res., 7-8 June 2021.

Other Applications

Fig. Location of weather stations in New York.

Consider the problem of predicting power outages. Power outages have a major impact on economic development due to the dependence of (virtually all) productive sectors on electric power. Thus, many resources within the scientific and engineering communities have been employed to improve the efficiency and reliability of power grids. In particular, we consider the problem of predicting power outages based on the current weather conditions. Weather measurements taken by a sensor network naturally fit within the graph signal processing framework since the measurements are related by the relative position of the sensors. We deploy graph neural networks to adequately process weather measurements in order to determine the likelihood of a power outage.

D. Owerko, F. Gama, and A. Ribeiro, "Predicting Power Outages Using Graph Neural Networks," in 2018 IEEE Global Conf. Signal and Inform. Process. Anaheim, CA: IEEE, 26-29 Nov. 2018, pp. 743-747.

Another application related to power systems that we further investigated is the problem of Optimal Power Flow (OPF). This is one of the most important optimization problems in the energy industry. In its simplest form, OPF attempts to find the optimal power that the generators within the grid have to produce to satisfy a given demand. Optimality is measured with respect to the cost that each generator incurs in producing this power. The OPF problem is non-convex due to the sinusoidal nature of electrical generation and thus is difficult to solve. Using small angle approximations leads to a convex problem known as DC OPF, but this approximation is no longer valid when power grids are heavily loaded. Many approximate solutions have been since put forward, but these do not scale to large power networks. In this paper, we propose using graph neural networks (which are localized, scalable parametrizations of network data) trained under the imitation learning framework to approximate a given optimal solution. While the optimal solution is costly, it is only required to be computed for network states in the training set. During test time, the GNN adequately learns how to compute the OPF solution. Numerical experiments are run on the IEEE-30 and IEEE-118 test cases.

D. Owerko, F. Gama, and A. Ribeiro, "Optimal Power Flow Using Graph Neural Networks," in 45th IEEE Int. Conf. Acoust., Speech and Signal Process. Barcelona, Spain: IEEE, 4-8 May 2020, pp. 5930-5934.

Graph Signal Processing

Controllability of Bandlimited Graph Processes over Random Time-Varying Graphs

Many relevant technological problems involving data stemming from social, financial, transportation or smart grid networks, among others, can be described in terms of graph structure. In many of these cases, we would like to be able to exert some control over the network to drive its behavior to a desired state. For example, in a transportation network, we would like to determine the time of traffic lights so we can re-route traffic to wider avenues instead of small streets. Or, in the case of a smart grid, we would like to determine how much power each generator should provide the network for a better allocation of resources.

Evidently, these networks can suffer changes in their structure due to edge-failures (closing streets in the transportation network, or failing power lines in the smart grid case) that can significantly alter the control strategy designed. We assume a random edge failure that changes with time, and we design a control strategy that takes into account these changes. More specifically, we define the concept of controllability in the mean, and derive conditions under which a network is controllable in finite time, when driving the state to a bandlimited graph signal, by acting only on a few selected nodes. Then, we carry out a mean squared error analysis to account for the randomness in the graph support and devise two strategies: a biased, minimum-MSE one, and an unbiased one.

Fig. An example of a network, and two realizations of the random edge-failure of the network, when some of the edges disappear at random, over time.

F. Gama, E. Isufi, A. Ribeiro, and G. Leus, "Controllability of Bandlimited Graph Processes over Random Time-Varying Graphs," IEEE Trans. Signal Process., vol. 67, no. 24, pp. 6440-6454, 15 Dec. 2019.

M. Coutino, E. Isufi, F. Gama, A. Ribeiro, and G. Leus, "Design Strategies for Sparse Control of Random Time-Varying Networks," in 53rd Asilomar Conf. Signals, Systems and Comput. Pacific Grove, CA: IEEE, 3-6 Nov. 2019, pp. 184-188.

F. Gama, E. Isufi, G. Leus, and A. Ribeiro, "Control of Graph Signals over Random Time-Varying Graphs," in 43rd IEEE Int. Conf. Acoust., Speech and Signal Process. Calgary, AB: IEEE, 15-20 Apr. 2018, pp. 4169-4173.

Ergodicity in Stationary Graph Processes: A Weak Law of Large Numbers

Graph processes are sequences of graph signals, all defined over the same graph. For example, the temperature being measured at the sensors in a network, where for each time instant, we have a different measurement that can be convenientely represented by a graph signal. Another example is the diffusion of a rumor through a social network, where at each time instant, we have a graph signal representing the persons that have heard the rumor.

Of particular interest are graph processes that are stationary on the graph. Graph stationarity implies that the first and second order statistical moments of the process are related to the underlying graph structure. To be precise, the mean of the process has to be aligned with the slowest-varying signal on the graph, while the covariance follows the spectral structure of the graph. More importantly, is that knowledge of these two statistical moments (mean and covariance) completely characterize stationary graph processes (i.e. everything that there is to know about the process is condensed in the mean and the covariance). Therefore, knowing the mean and the covariance becomes of utmost importance when dealing with stationary graph processes.

Typical estimators of these moments rely on the existence of a training set with multiple realizations. However, when the process is stationary, only one realization of the process is enough, a concept known as ergodicity. In this work we prove that, by diffusing the single realization of the process through the graph, we obtain an estimator that converges to the true mean of the process, in a result reminiscent of the weak law of large numbers. Moreover, we generalize the diffusion estimator to graph filters, deriving an optimal filter that minimizes the MSE and is proven to converge on any graph.

Fig. Measurements taken by a sensor network of a single realization of the process
Fig. The same realization has now been diffused across the sensors by repeated communications, building the diffusion estimator.
Fig. This is the actual true mean of the graph stationary process. We see that the output of the diffusion estimator resembles the true mean.

F. Gama and A. Ribeiro, "Ergodicity in Stationary Graph Processes: A Weak Law of Large Numbers," IEEE Trans. Signal Process., vol. 67, no. 10, pp. 2761-2774, 15 May 2019.

F. Gama and A. Ribeiro, "Distributed Estimation of Smooth Graph Power Spectral Density," in 2017 IEEE Global Conf. Signal and Inform. Process. Montreal, QC: IEEE, 14-16 Nov. 2017, pp. 643-647.

F. Gama and A. Ribeiro, "Weak Law of Large Numbers for Stationary Graph Processes," in 42nd IEEE Int. Conf. Acoust., Speech and Signal Process. New Orleans, LA: IEEE, 5-9 March 2017, pp. 4124-4128.

Rethinking Sketching as Sampling: A Graph Signal Processing Approach

Consider the problem of estimating the input to a given linear transform that generated a given measured output; y=Hx where y is the measured output, H is the linear transform and x is the input to be estimated. The approximate solution that minimizes the euclidean norm of the error is the least squares (LS) solution and it entails the calculation of a (pseudo-)inverse. This is a costly operation and in the context of Big Data, in which the dimensions of the linear transform are very large, it becomes prohibitive. As a workaround on this issue, sketching techniques have emerged. The idea is to draw a sketch of the linear transform, i.e. to reduce its dimension while retaining its most important traits, so that the LS solution can be computed by means of the new reduced-dimension transform yielding a good approximation to the LS solution. This is typically achieved by means of a random projection, i.e. the multiplication by a dimension reducing random matrix of both the measured output and the linear transform. Probabilistic guarantees on this approximate solution are given by sketching methods.

If the dimension of x is smaller than the dimension of y, then the equation y=Hx can be interpreted as y living in a lower dimensional subspace given by the columns of H. In fact, it is this lower dimensional structure that allows for sketching methods to work: there is a redundancy of information in using all values of y since they are actually determined by a combination of fewer elements. A subspace representation in the context of graph signal processing is equivalent to the graph signal being bandlimited; and it is well known that bandlimited (graph) signals can be sampled to obtain a sparser representation. Most sampling schemes of graph signals are designed so that the signal can be recovered from its samples and no further processing is taken into account. However, given that bandlimited signals have a subspace representation, then it is reasonable to exploit sampling schemes for sketching of linear transforms.

Fig. Direct problem. Design selection sampling matrix C and matrix sketch Hs to approximate output of applying linear transform.
Fig. Inverse Problem. Design selection sampling matrix C and sketch of estimator Hs that operates on output to estimate input.

Devising sampling schemes for sketching purposes turns out to be particularly useful in the context of streaming, wherein a succession of inputs need to be processed as fast as possible. While sketching methods do process signals faster, they do not account for any particular structure of the signal to be processed, hence multiplying by the same random projection matrix both the linear transform and the signal vector, each time. However, in the context of streaming, the linear transform remains the same while the signal vectors change in succession. Thus, by observing that the linear transform remains unchanged, the matrix sketch can be obtained offline and used for each incoming vector signal. Additionally, there is no faster dimension-reducing operation for the signal vector than sampling. Therefore, by designing offline a sketch of the linear transform to operate only on few samples of the vector signal the whole processing can be carried out faster.

All in all, in the context of processing linearly bandlimited graph signals in a streaming fashion, the objective is to jointly design offline a sketch of the transform and a sampling scheme so that an approximation to the desired result is obtained by just operating only over a few samples of each signal. In the example below, each signal is the image of a digit, and these signals turn out to be bandlimited in the covariance-graph (in other words, they admit a subspace representation in the PCA domain) and the objective is to use a linear classifier on them. Then, by selecting only 20 pixels out of the 784 pixels the image has (shown in red in the figure), perfect classification can be achieved.

Fig. It suffices to select the pixels in red to be able to classify the digits in the images with high accuracy. (a, b, c) Pixels selected by sampling-for-reconstruction methods. (d, e, f, g, h) Pixels selected by different methods of solving the problem of optimizing over the best selection of pixels (matrix C) and the best sketching of the linear transform (matrix Hs).

F. Gama, A. G. Marques, G. Mateos, and A. Ribeiro, "Rethinking Sketching as Sampling: A Graph Signal Processing Approach," Signal Process., vol. 169, pp. 107 404 (1-15), Apr. 2020.

F. Gama, A. G. Marques, G. Mateos, and A. Ribeiro, "Rethinking Sketching as Sampling: Approximate Solution to Linear Inverse Problems," in 2016 IEEE Global Conf. Signal and Inform. Process. Washington, DC: IEEE, 7-9 Dec. 2016, pp. 390-394.

F. Gama, A. G. Marques, G. Mateos, and A. Ribeiro, "Rethinking Sketching as Sampling: Linear Transforms of Graph Signals, " in 50th Asilomar Conf. Signals, Systems and Comput., Pacific Grove, CA, 6-9 Nov. 2016, pp. 522-526.

Topological Data Analysis

Spherical Convolutional Neural Networks

Spherical convolutional neural networks (Spherical CNNs) learn nonlinear representations from 3D data by exploiting the data structure and have shown promising performance in shape analysis, object classification, and planning among others. This paper investigates the properties that Spherical CNNs exhibit as they pertain to the rotational structure inherent in spherical signals. We build upon the rotation equivariance of spherical convolutions to show that Spherical CNNs are stable to general structure perturbations. In particular, we model arbitrary structure perturbations as diffeomorphism perturbations, and define the rotation distance that measures how far from rotations these perturbations are. We prove that the output change of a Spherical CNN induced by the diffeomorphism perturbation is bounded proportionally by the perturbation size under the rotation distance. This stability property coupled with the rotation equivariance provide theoretical guarantees that underpin the practical observations that Spherical CNNs exploit the rotational structure, maintain performance under structure perturbations that are close to rotations, and offer good generalization and faster learning.

Z. Gao, F. Gama, and A. Ribeiro, “Spherical Convolutional Neural Networks: Stability to Perturbations in SO(3),” arXiv:2010.05865v2 [cs.LG], 3 Apr. 2021, submitted to Signal Process.

Z. Gao, F. Gama, and A. Ribeiro, “Stability of Spherical Convolutional Neural Networks to Rotation Diffeomorphisms,” in 29th Eur. Signal Process. Conf. Dublin, Ireland: Eur. Assoc. Signal Process., 23-27 Aug. 2021.

Hierarchical Overlapping Clustering of Network Data Using Cut Metrics

Clustering is the task of grouping data points in such a way that those points within the same group are more similar to each other than to points in other groups. Let's assume that we have a set of data points that are related in some known way, then the objective of clustering is to partition the space of points into clusters that enclose those points that are similar.

The first difficulty we face with this definition is, precisely, what is it that we determine to be similar. That is, when is it that two points are similar and thus should be clustered together? The idea of similarity has a built-in notion of scale or resolution: depending on the scale we are considering, two points might or might not be similar. Thus, in order to overcome this difficulty we should consider all possible levels of resolution and partition the node space for each scale. Intuitively, these partitions should be nested, that means that once two points are clustered together at a finer resolution (smaller scale), they should stay that way for a coarser resolution (larger scale). This is achieved by the use of hierarchical clustering methods. These methods output a hierarchy of clusters indexed by scale from a finer resolution to a coarser resolution. This kind of nested hierarchy is called a dendrogram and it is built using a measure of distance between points called an ultrametric.

Fig. Dendrogram on the left. Example of partitions for two different scales.

The second problem resides in the decision to partition the space of points. A partition entails the fact that each point belong to one and only one group. However, in many cases there are some data points that have traits of or similarities to more than one group, so we do not want to restrict ourselves to put points in only one group. Thus, we need to cluster in coverings instead of partitions, allowing for nodes to belong to more than one group.

Fig. Chaining effect problem when applying single linkeage clustering to a dumbbell network. While it is clear that the two clouds of points on the side should be separate clusters it is not clear how to group the points in the middle, since they are at equal distance from points on the cloud and on the rest of the bar.

The main contribution of this work is a clustering method that is both hierarchical, allocating for all possible resolutions of similarity, and admits overlap, allowing data points to be clustered in more than one group. The basic building block of the resulting hierarchical overlapping clustering method are a new measure of points called a cut metric. A cut metric measures distances between points and it generates a nested collection of coverings solving both the similarity and the partition problem. The paper also provides a systematic way of constructing cut metrics via a linear combination of ultrametrics, which are readily available from known hierarchical methods.

Fig. Example of classification of digits 1 and 7 and ambiguous digits.
Fig. Example of classification of plays by Shakespeare, by Fletcher and co-authored.

F. Gama, S. Segarra, and A. Ribeiro, "Hierarchical Overlapping Clustering of Network Data Using Cut Metrics," IEEE Trans. Signal, Inform. Process. Networks, vol. 4, no. 2, pp. 392-406, June 2018.

F. Gama, S. Segarra, and A. Ribeiro, "Overlapping Clustering of Network Data Using Cut Metrics," in 41st IEEE Int. Conf. Acoust., Speech and Signal Process. Shanghai, China: IEEE, 20-25 March 2016, pp. 6415-6419.