Dr. Oliva explained examples of how knowledge of human perception and cognition can be leveraged to develop deep neural network systems of visual recognition. She first started off by breaking down the tasks in Visual Recognition. The first step is Activity Recognition: Moments in Time. This step first aims to understand what is an event and how long it lasts. The purpose that events are typically three seconds (in the dataset they use) and if they are more than three seconds, the first three seconds are enough to understand the event. Features for activities are extracted using Deep Learning Models and they use attention heat maps to understand which regions of the picture correspond to the activity. The next step is Multiple Recognition. In this step, they allow for activities to have additionally labels. With multiple tables, they see weak localization in the heat maps. Currently, these two steps are things we are currently able to do with existing technologies. The next step is to do Abstraction? Can we understand RAW videos and accurately label? Two tasks to measure how well models can abstract: 1) Set completion and 2) Odd one out. Giving sets of videos the model should be able to tell us the abstraction in common. See image below. The abstraction can be a combination of visual features as input with natural language superivions to generate a high-level representation of similarities across a set of videos. Set completion task consists of selecting the query video that best fits in the references videos (you are given reference/query sets). Next we can aim to use visual analogies to understand machine perception. Visual analogies will be an upcoming area because there will be a lot of the model to do. This can help us move towards common sense like reasoning. Additionally, visual memorability is a consequence of optimizations required for visual processing. This is because we when we are learning we are automatically doing this as humans. So how do we understand in current models or for computers?
Questions:
Do you do anything to understand the bias in your datasets? We see that current VQA dataset sare often biases.
Dr. Daniel Yurovsky talked about how toddlers learn words and how we can learn from their learnings. Toddler's learn by solving goals. They learn language by learning how to use it.
They also study the relationship between words and the context is complex. Shoe vs Chair, Shoe is bad for sneaker vs loafer, but if you learn loafer, then you can still compare banana w. loafer. CAN YOU LEARN NEW CONCEPTS and CONTINUE TO USE THEM.
Dr. Tenenbaum starts of by presenting examples of intuitive physics and intuitive psychology. We are very far from this and the commonsense of other animals. What are we missing from the current ML paradigm. We should be able to generalize to an infinite range of new tasks, environments (vastly different from training environments) with essentially no re-training or fine-tuning. What is the starting state (inductive bias)? What are the learning procedures? To engineer common sense we need:
1. probabilistic programming languages (Symbolic languages)
2. bayesian inference
3. neural netowrks
Introduction: probmods.org
Dave Epstein explains how SoTA vision models can learn new words. He makes an analogy that kids can learn new words using their context but machines cannot. What can we draw from toddlers: visual and linguistic context to learn, strategy for acquiring language from ambiguous tasks, 3) ??? . Learn what "stir" means is more interesting to learn how to learn "stir". This is similar to meta learning. The authors frame this from has a meta-learning framework and use episode learning.
(Can we use this to solve Lexical Gaps)
Common sense is not recorded in images as explicitly as it is in language
Dr. Smith takes an egocentric view to understand the experience of toddlers. Typically individual objects are a toddler's point of view. They focus on one category. However, in AI we train our models on ALL types of coffee cups, but we don't need to know what all coffee cups look like to learn what a coffee cup is. They know of all aspects of a single object (i.e., sippy cup) however they still won't overfit on their data. Individual objects are the units of experience. Few objects at high frequency and lots of objects at low frequency but babies can still learn. Object views are not stimulus points. Babies are biased and their data resulted in better training. They are also biased because their longest axis corresponds with the major axis of the 3D object.
Book reference in response to a question about blind babies: Freaks of Nature