Title: Certified Machine Learning Safety
Abstract: Machine learning (ML) techniques, such as deep neural networks (DNNs), provides incredible opportunities to answer some of the most important and difficult questions in computational sciences. Scientists and engineers are increasingly adopting the use of ML for making potentially important decisions, such as, design and optimization in their applications of interest. Applications pertinent to the Department of Energy (DOE) mission, such as materials design, asteroid detection etc., are often safety-critical in nature, implying that faulty decisions/predictions can have severe consequences. In safety-critical systems, it is important to answer questions, such as, how can we be sure that the designed system is going to behave in a known, consistent and correct manner? Unfortunately, a foundational challenge in current ML quality assurance research is that one cannot reliably validate if the model has adequately learned the input domain, and guarantee the model will behave in an expected manner under a broad range of unseen conditions; yet given the high-regret nature of mission-critical applications, devising mathematically rigorous validation and verification techniques are crucial for a sustained acceptance of data-driven ML solutions in DOE mission areas. In this talk, I will discuss some recent progress in this direction and highlight some future directions.
Title: Marginalized Graph Kernels for Learning on Molecules: Theory, Software Stack, and Applications
Abstract: In this talk, I will demonstrate how kernel-based learning methods can be applied to graph-based datasets for building predictive models of molecular and atomistic properties that are accurate and interpretable using scarce training data. I will first introduce an active learning protocol using Gaussian process regression, which prompts the need for similarity measures between either entire molecules or individual atomistic neighborhoods. This leads to our recent work on a kernel that operates on graphical representations of molecules. The graph kernel is intuitive and allows the application of the kernel trick to molecules of arbitrary size and topology. I will then introduce GraphDot, a Python package that implements the marginalized graph kernel for and beyond molecules on general-purpose GPUs. GraphDot delivers thousands of times of speedups against existing CPU-only packages. This is achieved by taking advantage of a generalized Kronecker product structure in the graph kernel, which leads to a streaming algorithm that can fully exploit the GPU's tremendous floating point capability. A two-level sparse format and an adaptive primitive switching mechanism are used to ensure equal efficiency of the algorithm for both dense and sparse graphs. In the last part, I will showcase specific applications using the graph kernel framework and the GraphDot package to rapidly construct Gaussian process regression models within minutes. With an active learning procedure, we can build GPR models to obtain accurate predictive models of atomization energies without using explicit energy decomposition/localization. More examples involving the direct prediction of experimental observables will also be given.
Title: Few-Shot Learning for Biomedical Instance Segmentation
Abstract: Image segmentation is a laborious task representing a bottleneck in many biomedical applications. However, creating segmentation training data for convolutional neural networks (CNNs) is a time-intensive process requiring expert human annotation by clinicians or scientists. Therefore, there is a need for methods to reduce the annotation burden by (1) drastically decreasing the amount of data required to accurately train CNNs for segmentation, or (2) semi-automating the segmentation process while requiring minimal expert annotation. I will present an intuitive algorithm for hyper-accurate segmentation that is both robust and able to learn from only a few training images (few-shot learning). Our algorithm operates as a human does: by tracing the boundary of an object. Moreover, this type of iterative approach provides a natural platform for performing human-in-the-loop segmentation that is more accurate than CNNs alone and orders of magnitude faster than manual segmentation. I will also discuss the opportunities for analyzing the new wealth of data these hyper-accurate segmentations provide, using cell microscopy images as an example.
Title: Towards developing computational methods for biological control
Abstract: In this talk, we refer to the achievement of an intended and predicted response in a biological system as controlling biology. Efforts towards controlling biology have evolved on different scales from controlling gene expression at the single-cell level to controlling large complex networks such as glucose regulation via an artificial pancreas. However, most approaches are not immune to the inherent properties of nature such as stochasticity and unmodeled dynamics. What allows nature to evolve and life to exist is what makes it challenging to control. New technology in synthetic biology and bioelectronics can give us unprecedent spatiotemporal control over nature. Through adaptive external “sense and respond” learning algorithms, we can gain improved control over cellular response. In this talk, will discuss work done towards developing NN-based predictors and feedback controllers in order to direct cellular response with no model a priori and no offline training. We also introduce efforts towards data-driven state-transition models of complex biological processes in order to identify and drive systems towards desired reachable states via our ML-based controllers.
Title: Uncertainty quantification in chaotic systems using unsupervised machine learning
Abstract: Model parameter estimation using inference from data relies on the robust definition of a distance metric between data sets, typically chosen to be some vector norm computed using a logical ordering of the elements. For chaotic dynamical systems, residuals defined using point-wise distances between temporally collocated data points generated from numerical simulations and ground truth are no longer useful as small numerical perturbations result in disparate trajectories, even when the underlying parameters are identical. Rather than compare the data directly, we explore ideas to compare the overall trajectory structure as a means to define distances, and to make this effort tractable by operating with coordinates defined by the low-dimensional manifold where the trajectory data lives, estimated using unsupervised machine learning approaches.
Title: Advances in Protein Identification Via Spectral Library Search
Abstract: The key computational component in tandem mass spectrometry (MS/MS), also known as shotgun proteomics, is matching MS/MS spectra against theoretical spectra or actual spectra in spectral databases to identify possible peptides (protein sections). Recent advances in mass-spectrometry technology has led to a several orders of magnitude increase in proteomics experiment throughput, which in turn has enabled the creation of massive spectral databases. The high computational complexity involved in the search process has lead to approximate methods being primarily used to query large databases, which fail to identify many of the spectra. Given the natural representations of mass spectra as sparse vectors in Euclidean space, efficient exact search methods can be devised that solve the search problem by ignoring the majority of spectra that are not close to the query. In my talk, I will first give some examples of work we are doing in my lab in the smart city and biomedical domains, and then present some preliminary results from our work on improving protein identification through spectral library search.
Title: Mapping Crop Types with Smartphone Crowdsourcing and Deep Learning
Abstract: Smallholder farms make up 84% of the world's 570 million farms, and are vital to achieving food security under growing populations in the developing nations of Asia and Africa. Despite their importance, there remains a scarcity of data on smallholder crop production, starting at the fundamental questions of which crop types smallholders grow and where they grow them. High resolution satellite imagery and modern machine learning methods hold the potential to fill these data gaps; however, high resolution crop type maps have remained challenging to create in developing regions due to a lack of ground truth labels for model development. This talk will show the use of crowdsourced data as a novel source of labels for crop type mapping in India. Plantix, a free app that uses image recognition to help farmers diagnose crop diseases, logged 10 million geolocated photos from 2017-2019 in India. The resulting dataset of crop type at geo-coordinates is high in volume, but also high in noise due to location inaccuracies and labeling errors. We will show how noise can be reduced to allow the labels to train a convolutional neural network (CNN) on satellite time series to map crop types in the smallholder systems of southeast India at 10m resolution.
Title: Tuning and control of particle accelerators and light sources using Bayesian optimization
Abstract: High-dimensional optimization is a critical challenge for operating large-scale scientific facilities that push the limits of technical and physical understanding. An example is the x-ray free-electron laser, where the extreme sensitivity to initial conditions requires a substantial fraction of operation time to be devoted to blind optimization. In this work we report on a Bayesian optimization to transport optics tuning by controlling groups of quadrupole magnets to maximize X-ray laser pulse energy. We employ an instance-based learning to adapt to new machine configurations and utilizing prediction uncertainty to ensure robustness to noise. The machine response with respect to control parameters is modeled using a Gaussian process probabilistic model. We show that Gaussian process models can be both trained on data and directly incorporate physical models to beat the current state-of-the-art optimizers.
Title: The Shattering Transform: formalizing convolutional networks to analyze few example raw sonar data
Abstract: There has been some progress in understanding the principles guiding convolutional neural nets (CNNs) using the tools of harmonic analysis, primarily through the scattering transform. This casts CNNs as alternations between well-understood linear systems, such as wavelets, some mixing non-linearity, and subsampling. There are resulting guarantees on the robustness of the transform to various signal transformations, such as translation or local deformations. Unlike CNNs, the scattering transform has no parameters to train, and acts as a preprocessing step before applying other classification methods. In this talk, I will introduce the shearlet scattering, or shattering, transform, which uses shearlets as the linear system in the scattering transform. Shearlets are guaranteed to represent mostly smooth signals with continuous edges in almost as sparse a way as possible. This property generalizes to their use in the scattering transform, leading to a theoretically justified use of LASSO and other sparse linear methods. We use this transform to solve a classification problem on raw sonar data, distinguishing unexploded ordinance from rocks and other common objects; the shape of the sonar signals rules out the use of transfer learning, while the expense of generating data leads to (relatively) small sample sizes, ruling out training CNNs from scratch.