About AI4Science

By Jeremy Bernstein

August, 2018

Professor Yisong Yue discussing the multi-armed bandit problem. Originally inspired by slot machines, this problem is ubiquitous in engineering

(Photo credits: Angie Liu)

Professor Anima Anandkumar discussing tensor methods, which are a set of tools that can efficiently solve various problems in machine learning.

Professor Maria Spiropulu discussing the volumes of data being generated in the hunt for physics beyond the standard model.

Discussions over lunch brought together scientists from across Caltech.

AI4science initiative at Caltech

Across science---from astrophysics to molecular biology to economics---a common problem persists: scientists are overwhelmed by the sheer amount of data they are collecting. But this problem might be better viewed as an opportunity, since with appropriate computing resources and algorithmic tools, scientists might hope to unlock insights from these swathes of data to carry their field forward. AI4science is a new initiative at Caltech aiming to bring together computer scientists with experts in other disciplines. While somewhat of a suitcase term, AI or artificial intelligence here means the combination of machine learning algorithms with large compute resources.

The initiative was launched with a workshop at Caltech. Professor Yisong Yue of Caltech’s Computing & Mathematical Sciences department (CMS) gave the first talk, where he gave a general overview of machine learning algorithms and their relevance across science and engineering. For example, he discussed how supervised learning may be used for problems ranging from detecting fake news, to classifying genetics sequences, to predicting the thermal stability properties of chemicals. He moved on to discuss his work on bandit algorithms. The bandit problem asks the following question: given that an agent must repeatedly play a set of n slot machines, each with an unknown payoff function, how should the agent pick which machines to play so as to learn to focus her efforts on the best machine? While certainly of interest to the gambling addict, bandit problem turns out to be ubiquitous---in fact, later in the day, Professor Joel Burdick of the Division of Engineering and Applied Science discussed how he’d successfully used Professor Yue’s bandit algorithm in developing a neural prosthesis. The prosthesis was used to help paraplegic patients stand almost unaided, with the bandit algorithm helping to find the right neural stimulus to deliver to the patient’s spinal cord.

Professor Andrew Stuart, also in CMS, gave the talk following Professor Yue. Stuart discussed his interest in fusing data science techniques with known physical law. As an example, he discussed how climate modelling is very difficult in spite of our knowledge of climate physics. This is in part due to modelling errors accumulating from chaotic and turbulent effects. But Stuart noted that a vast amount of climate data has been collected from satellite observations, and in work with Professor Tapio Schneider of Caltech’s Climate Science Group, he is attempting to combine this data with existing physics models to obtain more accurate predictions of the climate on Earth. Addressing this problem is immensely important for advising public policy regarding the rise in atmospheric CO2 concentrations.

Frederick Eberhardt, professor of philosophy, spoke next. He discussed his work on causal inference. A basic scientific principle is that “correlation does not imply causation”. But what does imply causation? Accurately identifying causes is extremely important. To drive the point home Eberhardt recounted how a 1999 Nature study identified a correlation between myopia in children and the use of night lights. This might have suggested that parents should avoid using night lights to protect their children’s eyesight. But a subsequent study suggested that myopic children and nightlight use might actually have a common confounding cause: myopic parents. Clearly, correctly identifying causes is important for giving proper health advice to the public. Eberhardt stressed that intervention is the essence of causal discovery---to determine whether A has a causal effect on B, we should intervene by carefully changing A and checking for any effect on B. Eberhardt expressed his desire to build causal discovery algorithms that scale up to problems with hundreds of thousands of variables, so that they may be brought to bear on the large scale problems that pervade science.

Professor Anima Anandkumar of the CMS department was the last computer scientist to speak. Anandkumar gave an overview of a successful machine learning technique known as artificial neural networks, which have dramatically improved the ability of computers to understand images and natural language. These neural networks are loosely inspired by the human brain---they have a large number of free parameters known as weights or synapses, and these weights are gradually fine tuned to reduce the error on a large set of training examples. Remarkably, for many problems, the neural network is able to generalise from the training set to new, previously unseen images. Anandkumar also spoke about tensor methods in machine learning. Tensors are higher order generalisations of matrices, and they appear naturally in many machine learning problems known as latent variable models. Learning in these latent variable models is reduced to a decomposition problem of a particular tensor extracted from the training data. The success of neural networks and tensor methods has shown that with enough data and compute resources, it can be easy to extract interesting patterns from data. It is this insight that Anandkumar is excited to bring to other areas of science.

The remainder of the day was devoted to talks from scientists who have had success applying machine learning techniques in their respective fields. Lior Pachter, professor of computational biology, discussed his work applying supervised learning to predict translation rates from sequences of RNA code. Pachter found that for this problem simple linear models worked well, and that deep networks were overkill. Maria Spiropulu, professor of high energy physics, discussed how petabytes of data are being generated every second at the Large Hadron Collider. Spiropulu has had success applying convolutional neural networks to recognise particles in the collected data. Kiri L. Wagstaff of the Machine Learning Autonomy Group at the Jet Propulsion Laboratory (JPL), which is managed by Caltech for NASA, discussed her work in the search for pulsars. Wagstaff has used machine learning in an interesting symbiosis with human researchers---the machine learning model whittles down a list of candidate pulsars meaning less human time must be spent manually checking for false positives. Zach Ross, postdoc in Geophysics, discussed his work using convolutional neural networks to identify tiny seismic events. He hopes these methods can be used in an early warning system to identify foreshocks in the early onset of a large earthquake. Finally Professor Joel Burdick talked about his work aiming to help spinal cord injury patients to walk. Burdick discussed how Caltech professor Yisong Yue’s theory of duelling bandits helped the stimuli delivered to the spinal cord converge to optimal values without needing to define an explicit notion of success.

All speakers at the workshop expressed a sense of excitement about the potential for data driven methods to tackle complex problems across scientific disciplines. More workshops are in the pipeline, and the AI4science initiative will progress with the computer scientists holding weekly office hours to enable scientists from other fields to ask questions about data-driven algorithms. With the availability of massive computational resources as well as the torrents of data pouring in from scientific labs, it will be interesting to see how the AI4science initiative develops at Caltech.