Data and Codes

IPC: A New Benchmark Graph Datasets [link]

In our recent works "Online Planner Selection with Graph Neural Networks and Adaptive Scheduling" and "IPC: A Benchmark Data Set for Learning with Graph-Structured Data", we published a new dataset for the evaluation of machine learning methods on graphs. It contains 2439 labeled graphs, presplit for training/validation/testing. The node counts have a highly skewed distribution, ranging from less than ten to a few hundred thousands. This data set was initially used to study graph neural networks for AI planning; see Ma et al. (2018). It serves as a benchmark data set for graph representation learning and other machine learning tasks.


The graph convolutional networks (GCN) recently proposed by Kipf and Welling are an effective graph model for semi-supervised learning. This model, however, was originally designed to be learned with the presence of both training and test data. Moreover, the recursive neighborhood expansion across layers poses time and memory challenges for training with large, dense graphs. To relax the requirement of simultaneous availability of test data, we interpret graph convolutions as integral transforms of embedding functions under probability measures. Such an interpretation allows for the use of Monte Carlo approaches to consistently estimate the integrals, which in turn leads to a batched training scheme as we propose in this work---FastGCN. Enhanced with importance sampling, FastGCN not only is efficient for training but also generalizes well for inference. We show a comprehensive set of experiments to demonstrate its effectiveness compared with GCN and related models. In particular, training is orders of magnitude more efficient while predictions remain comparably accurate.


Drug similarity has been studied to support downstream clinical tasks such as inferring novel properties of drugs (eg side effects, indications, interactions) from known properties. The growing availability of new types of drug features brings the opportunity of learning a more comprehensive and accurate drug similarity that represents the full spectrum of underlying drug relations. However, it is challenging to integrate these heterogeneous, noisy, nonlinear-related information to learn accurate similarity measures especially when labels are scarce. Moreover, there is a trade-off between accuracy and interpretability. In this paper, we propose to learn accurate and interpretable similarity measures from multiple types of drug features. In particular, we model the integration using multi-view graph auto-encoders, and add attentive mechanism to determine the weights for each view with respect to corresponding tasks and features for better interpretability. Our model has flexible design for both semi-supervised and unsupervised settings. Experimental results demonstrated significant predictive accuracy improvement. Case studies also showed better model capacity (eg embed node features) and interpretability.


Product for Watson Education

I worked in the education research group for a while, where I joined the development of Watson tutoring system. Our product has been featured in the Discovery Documentary "This is AI". If you are interested in how AI changes the education, you may want to watch the video here and see the homepage of Watson Education.