Past talks

Title: Geometric Fields: Animation Beyond Meshes

Speaker: Ana Dodik

Time: Apr 16th, 12-1pm ET, 2024


Abstract: Despite a flurry of published papers on 3D data processing, modern tools such as Blender and Maya rely primarily on methods from over a decade ago. These tools rely on the finite-element method (FEM) to optimize various smoothness objectives. FEM-based algorithms---while providing strong guarantees necessary for disciplines like civil engineering---beget opaque and brittle pipelines in computer graphics. They put the burden on the end-user to ensure that 3D meshes are ``well-behaved'', requiring the user to e.g. remove self-intersections or to ensure obscure mathematical properties such as manifoldness, water-tightness, or suitable interior angle bounds. My research uses techniques inspired by modern machine learning tools, not for data-driven learning, but as the computational domain for problems in geometry processing that is agnostic of the shape representation and its quality, but is nonetheless aware of its geometry.  These new mesh-free representations---geometric fields---allow our algorithms to focus on robustness, user control, and interactivity. This talk focuses on applications of geometric fields to problems in shape deformation and animation.


Speaker bio: Ana Dodik is a PhD student at MIT CSAIL working on neural representations for geometry processing. Prior to joining MIT, she spent two years developing next-generation virtual presence at Meta. She graduated with a Master’s degree from ETH Zurich, where she spent a year collaborating with Disney Research Studios on problems at the intersection of machine learning and offline rendering.

Title: Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering

Speaker: Vivek Gopalakrishnan

Time: Apr 2nd, 12-1pm ET, 2024


Abstract: We investigate the camera pose estimation problem in the context of 2D/3D medical image registration. In our application, we seek to align 2D intraoperative images (e.g., X-rays) to 3D preoperative volumes (e.g., CT) from the same patient, helping provide 3D image guidance during minimally invasive surgeries. We present DiffPose, a patient-specific self-supervised approach that uses differentiable X-ray rendering to achieve the sub-millimeter registration accuracy required in this setting. Some aspects of our work that may be of interest to the broader CV community include 

This is joint work with Neel Dey and Polina Golland (https://arxiv.org/abs/2312.06358).


Speaker bio: Vivek is a third-year PhD student in the Harvard-MIT Health Sciences and Technology program. He is advised by Polina Golland and works on 3D computer vision problems across science and medicine.

Title: Neural Lithography: Close the Design to Manufacturing Gap in Computational Optics with a 'Real2Sim' Learned Photolithography Simulator

Speaker: Cheng Zheng, MIT MechE

Time: Mar 19th, 12-1pm ET, 2024


Abstract: Computational optics with large design degrees of freedom enable advanced functionalities and performance beyond traditional optics. However, the existing design approaches often overlook the numerical modeling of the manufacturing process, which can result in significant performance deviation between the design and the fabricated optics. In this talk, I will introduce neural lithography, including a real2sim pipeline to quantitatively construct a high-fidelity neural photolithography simulator and a design-fabrication co-optimization framework to bridge the design-to-manufacturing gap in computational optics.


Speaker bio: Cheng Zheng is a Ph.D. student at MIT, mentored by Prof. Peter So, with her research centered on computational imaging and optics. She is pioneering developments in computational imaging and computational lithography, aiming to create advanced imaging and optical systems that intelligently and informatively interact with the physical world.

Title: Generative Models for Creative Applications

Speaker: Yael Vinker, Tel Aviv University

Time: Mar 12th, 12-1pm ET, 2024


Abstract: The initial stage of a design process is often highly explorative and unexpected, involving activities such as brainstorming, seeking inspiration, sketching, and planning. These activities require prior knowledge, creativity, and design skills. Can computers participate in such a highly creative process, assisting humans in developing and exploring design ideas? In this talk, I will share some of my recent research, which explores this question from various perspectives. I will demonstrate how my work leverages recent advancements in large vision-language models to drive advancements in this field. Specifically, I will discuss the automatic generation of visual abstractions and sketches from images (CLIPasso and CLIPascene). I will also demonstrate how such models can be used in typography to create semantic word-as-image illustrations, and in design, to decompose a visual concept, represented by a collection of images, into distinct visual elements encoded into a hierarchical tree, facilitating exploration and inspiration. Finally, I will share my perspective on how human-machine collaborative interaction can be enhanced through generative models.


Speaker bio: Yael Vinker is a PhD student at Tel Aviv University advised by Prof. Daniel Cohen-Or and Prof. Ariel Shamir. Her research lies at the intersection of computer vision and art, with a focus on generative models and their capacity to produce non-photorealistic content. Yael has received two Best Paper Awards for her works "CLIPasso" (SIGGRAPH 2022) and "Inspiration Tree" (SIGGRAPH Asia 2023) and an Honorable Mention Best Paper Award for her work "Word-as-Image" (SIGGRAPH 2023). She is a recipient of the Council for Higher Education (VATAT) scholarship for outstanding female PhD candidates. Yael completed her BSc (combined with visual communication) and MSc at The Hebrew University of Jerusalem, and interned at Google Research, EPFL, and Disney Research.

Title: Co-Optimizing Human-System Performance in VR/AR

Speaker: Qi Sun, NYU

Time: Nov 22nd, 12-1pm ET, 2022


Abstract: Virtual and Augmented Reality enables unprecedented possibilities for displaying virtual content, sensing physical surroundings, and tracking human behaviors with high fidelity. However, we still haven't created "superhumans" who can outperform what we are in physical reality, nor a "perfect" XR system that delivers infinite battery life or realistic sensation. In this talk, I will discuss some of our recent research on leveraging eye/muscular sensing and learning to model our perception, reaction, and sensation in virtual environments. Based on the knowledge, we create just-in-time visual content that jointly optimizes human (such as reaction speed to events) and system performance (such as reduced display power consumption) in XR.


Speaker bio: Qi Sun is an assistant professor at New York University. Before joining NYU, he was a research scientist at Adobe Research. He received his PhD at Stony Brook University. His research interests lie in perceptual computer graphics, VR/AR, computational cognition, and visual optics. He is a recipient of the IEEE Virtual Reality Best Dissertation Award, with his research recognized as Best Paper and Honorable Mention awards in ACM SIGGRAPH. His research is funded by NASA, NSF, DARPA, NVIDIA, and Adobe.

Title: Applications of Human Perception to Photography and Readability

Speaker: Zoya Bylinskii, Adobe Research

Time: Nov 15th, 12-1pm ET, 2022


Abstract: How can we leverage our understanding of human perception and cognition to build tools to more effectively communicate information to broad audiences? The first half of this talk will be about how visual attention modeling can be applied to image processing applications, including the problems of detecting and suppressing distractors in photographs, in order to redirect viewer attention to what matters most in an image. The second half of this talk will apply perception insights to the redesign and personalization of text formats for improved readability and accessibility. Both examples of interdisciplinary work will showcase how we can combine insights from psychophysics and professional design workflows towards the development of practical, user-facing applications.


Speaker bio: Zoya is a Research Scientist in the Creative Intelligence Lab at Adobe Research (based in Cambridge, MA). Zoya received her Ph.D. in Computer Science from MIT in 2018, advised by Fredo Durand and Aude Oliva, and an Hon. B.Sc. in Computer Science and Statistics from the University of Toronto in 2012. Zoya works at the interface of human perception & cognition, computer vision, and human-computer interaction applied to graphic designs, visualizations, and readability.

Title: Towards Computational Soft Robotics For The Real World

Speaker: Andrew Spielberg, Harvard

Time: Nov 8th, 12-1pm ET, 2022


Abstract: Soft robotics has historically been dominated by iteration-heavy design cycles, relying on a process that has required a combination of physical intuition and trial-and-error sandwiched by long and costly fabrication times.  Methods for quickly designing soft robots in silico with computationally verifiable performance have the potential to save time and resources and vastly improve the types of robots being built.  In this talk I will describe our recent progress in computationally modeling, controlling, designing, and fabricating real soft robots.  I will discuss how emerging techniques in differentiable simulation and machine learning can automate the design cycles of physically fabricable robots, demonstrate how soft robots can be combined with learning-based control to solve difficult real-world tasks, and propose pipelines for digitally fabricating and modeling custom soft robots.  Finally, I will describe my vision for an end-to-end computational soft robotics stack, and future opportunities for the field.


Speaker bio: Andrew Spielberg is a Postdoctoral Fellow at Harvard University in the Lewis Lab; he is also a member of The Movement Lab at Stanford University.  His research focuses on developing algorithms to co-design novel types of rigid and soft robots in form and behavior, and automatically fabricate them.  His work has touched upon topics in soft matter and differentiable simulation, numerical optimization and machine learning for robot control and design, and digital fabrication processes such as 3D printing and textile-manufacturing.  He received his B.S. and M.Eng from Cornell University and his PhD from MIT.  His work won a best paper award at CHI, and has been nominated for best paper awards at ICRA and Robosoft.

Title: Reliable and Accessible Visual Recognition

Speaker: Judy Hoffman

Time: May 3rd, 12pm ET, 2022


Abstract: As visual recognition models are developed across diverse applications; we need the ability to reliably deploy our systems in a variety of environments. At the same time, visual models tend to be trained and evaluated on a static set of curated and annotated data which only represents a subset of the world. In this talk, I will cover techniques for transferring information between different visual environments and across different semantic tasks thereby enabling recognition models to generalize to previously unseen worlds, such as from simulated to real-world driving imagery. I'll highlight a new benchmark and method for selecting a source model for accessible transfer to a new visual task.


Speaker bio: Dr. Judy Hoffman is an Assistant Professor in the School of Interactive Computing at Georgia Tech and a member of the Machine Learning Center. Her research lies at the intersection of computer vision and machine learning with specialization in domain adaptation, transfer learning, adversarial robustness, and algorithmic fairness. She is the recipient of the Samsung AI Researcher of the Year award, Google Scholar Faculty Award, NVIDIA female leader in computer vision award, AIMiner top 100 most influential scholars in Machine Learning (2020), and MIT EECS Rising Star in 2015. In addition to her research, she co-founded and continues to advise for Women in Computer Vision, an organization which provides mentorship and travel support for early-career women in the computer vision community. Prior to joining Georgia Tech, she was a Research Scientist at Facebook AI Research. She received her PhD in Electrical Engineering and Computer Science from UC Berkeley in 2016 after which she completed Postdocs at Stanford University (2017) and UC Berkeley (2018).


Title: Robust Neural Field Models, and Humans Interacting in the 3D World

Speaker: Gerard Pons-Moll

Time: Feb 8th, 12pm ET, 2022


Abstract: The field of 3D shape representation learning and reconstruction has been revolutionised by combinations of neural networks with implicit surfaces and field representations. However, most models are not robust to rotations and translations of the input shape, lack detail, need to be trained with clean watertight shapes, and can not be controlled with terms of pose and shape. I will describe a class of neural implicit models (IF-Nets, NDF) which are designed to be robust to input shape transformations, can be trained from from raw scans, and can output non closed surfaces such as shapes with inner structures and 3D scenes. I will also describe our recent works to make NeRF generalise to novel (SRF) and dynamic scenes (D-NERF). I will also show how to broaden the representation power of neural implicit surfaces by exploiting generalized implicit surfaces (Neural-GIF). I will demonstrate how to use Neural-GIF to output detailed humans in clothing, with control over pose and shape.   

Beyond scene and shape generation and completion tasks, neural implicits can play an important role to build mental models of the 3D world, which are crucial to build autonomous humans interacting in the 3D world. Towards that ambitious goal, if time allows, I will conlcude with our recent work Human Poseitioning System (HPS), to capture and model natural human interactions in large scenes.


Speaker bio: Gerard Pons-Moll is a Full Professor at the University of Tübingen, at the department of Computer Science. He is also the head of the Emmy Noether independent research group "Real Virtual Humans", senior researcher at the Max Planck for Informatics (MPII) in Saarbrücken, Germany, and faculty at the IMPRS-IS (International Max Planck Research School - Intelligent Systems in Tübingen). His research lies at the intersection of computer vision, computer graphics and machine learning -- with special focus on 3D vision and analyzing people in videos, and creating virtual human models by "looking" at real ones. His research has produced some of the most advanced statistical human body models of pose, shape, soft-tissue and clothing (which are currently used for a number of applications in industry and research), as well as algorithms to track and reconstruct 3D people models from images, video, depth, and IMUs.



Title: Exploring Invertibility in Image Processing and Restoration

Speaker: Qifeng Chen

Time: Jan 25th, 12pm ET,, 2022


Abstract: Today’s smartphones have enabled numerous stunning visual effects from denoising to beautification, and we can share high-quality JPEG images easily on the internet, but it is still valuable for photographers and researchers to keep the original raw camera data for further post-processing (e.g., retouching) and analysis. However, the huge size of raw data hinders its popularity in practice, so can we almost perfectly restore the raw data from a compressed RGB image and thus avoid storing any raw data? This question leads us to design an invertible image signal processing pipeline. Then we further explore invertibility in other image processing and restoration tasks, including image compression, reversible image conversion (e.g., image-to-video conversion), embedding novel views in a single JPEG image. In the end, we demonstrate a general framework for restorable image processing operators with quasi-invertible networks.


Speaker bio: Qifeng Chen is an assistant professor at The Hong Kong University of Science and Technology. He received his Ph.D. in computer science from Stanford University in 2017. His research interests include image processing and synthesis, 3D vision, and autonomous driving. He was named one of 35 Innovators under 35 in China by MIT Technology Review and received the Google Faculty Research Award in 2018. He won 2nd place worldwide at the ACM-ICPC World Finals and a gold medal in IOI. He co-founded the startup Lino in 2017.



Title: Generative Modeling by Estimating Gradients of  the Data Distribution

Speaker: Yang Song

Time: Dec 14th, 12pm ET, 2021

 

Abstract: Generative models typically fit a probability  distribution to a dataset and sample from it to synthesize new data, such as images and audio. However, parameterizing a probability distribution is non-trivial, due to the constraint that all properly defined probability distributions must be normalized.  This requires restricted model architectures, or undesirable approximations. To address these challenges, we propose an alternative approach based on estimating the vector field of gradients of the data distribution (known as the score function), which does  not require normalization. The resulting generative models, called score-based generative models, sidestep challenges of previous approaches and obtain uniformly better performance on sample generation, challenging the long-time dominance of GANs on tasks  including images, audio, molecules and material structures. I will further show how they are useful for solving ill-posed inverse problems, with strong performance in applications ranging from image editing to medical image reconstruction.


Bio: Yang Song is a sixth year Ph.D. student advised  by Prof. Stefano Ermon. His research interest is in deep generative models and their applications to inverse problem solving and AI safety. His first-author papers have been recognized with an Outstanding Paper Award at ICLR, and an oral presentation at NeurIPS-2019.  He is a recipient of the Apple PhD Fellowship in AI/ML, the J.P. Morgan PhD Fellowship, and the WAIC Rising Star Award.



Title: Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Speaker: Ming-Yu Liu

Time: Nov 30th, 12pm ET, 2021


Abstract: Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. They are often unable to leverage multimodal user inputs when available, which reduces their practicality. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting.


About the speaker: Ming-Yu Liu is a distinguished research scientist and a director of research at NVIDIA. His research focuses on deep generative models and their applications, and his ambition is to enable machine super-human imagination capability so that machines can better assist people in creating content and expressing themselves. Liu likes to put his research into people’s daily lives --- and NVIDIA Canvas/GauGAN and NVIDIA Maxine are two products enabled by his research. He also strives to make the research community better and frequently serves as an area chair for various top-tier AI conferences, including NeurIPS, ICML, ICLR, CVPR, ICCV, and ECCV, as well as organize tutorials and workshops in these conferences. Empowered by many, Liu has won several major awards in his field, including winning a SIGGRAPH Best in Show award two times. Prior to NIVIDA, he was a Principal Research Scientist with Mitsubishi Electric Research Laboratories (MERL). He received his Ph.D. degree from the University of Maryland, College Park, MD, USA, in 2012, advised by Prof. Rama Chellappa.



Title: Looking at a Few Images of Rooms and Many Interacting Hands

Speaker: David Fouhey

Time: Nov 2nd, 12pm ET, 2021


Abstract: The long-term goal of my research is to let computers understand the physical world from images, including both 3D properties and how humans or robots could interact with things. This talk will summarize two recent directions aimed at enabling this goal.


I will begin by talking about 3D reconstruction from two ordinary images where the camera pose is unknown and the views have little overlap -- think hotel listings. Computers struggle in this setting since standard techniques usually depend on many images, high overlap, known camera poses, or RGBD input. Nonetheless, humans seem quite adept at building a sense of a scene from a few photos. We think the key to this ability is joint reasoning over reconstruction, camera pose, and correspondence. This insight is put into action with a deep learning architecture and optimization that produces a coherent planar reconstruction. Our system outperforms many baselines on Matterport3D, but there is plenty of room for new work in this exciting setting.


Then, I will focus on understanding what humans are doing with their hands. Hands are a primary means for humans to manipulate the world, but fairly basic information about what they're doing is often off limits to computers (or, at least in challenging data). I'll describe some of our efforts on understanding hand state, including work on learning to segment hands and hand-held objects in images via a system that learns from large-scale video data.


Bio: David Fouhey is an assistant professor in the University of Michigan EECS department. He received a PhD in robotics in 2016 from CMU where he was an NSF and NDSEG fellow, then was a postdoctoral fellow at UC Berkeley. He has spent time at Oxford's Visual Geometry Group, INRIA Paris, and Microsoft Research. More information about him can be found here: http://web.eecs.umich.edu/~fouhey/ .


Deep Surface Meshes

Speaker: Pascal Fua

Time: Oct 26th, 3pm ET, 2021


Abstract: Geometric Deep Learning has recently made striking progress with the advent of Deep Implicit Fields (SDFs). They allow for detailed modeling of watertight surfaces of arbitrary topology while not relying on a 3D Euclidean grid, resulting in a learnable 3D surface parameterization that is not limited in resolution. Unfortunately, they have not yet reached their full potential for applications that require an explicit surface representation in terms of vertices and facets because converting the SDF to such a 3D mesh representation requires a marching-cube algorithm, whose output cannot be easily differentiated with respect to the SDF parameters.


In this talk, I will discuss our approach to overcoming this limitation and implementing convolutional neural nets that output complex 3D surface meshes while remaining fully-differentiable and end-to-end trainable. I will also present applications to single view reconstruction, physically-driven Shape optimization, and bio-medical image segmentation.



Bio: Pascal Fua received an engineering degree from Ecole Polytechnique, Paris, in 1984 and a Ph.D. in Computer Science from the University of Orsay in 1989. He joined EPFL (Swiss Federal Institute of Technology) in 1996 where he is a Professor in the School of Computer and Communication Science and head of the Computer Vision Lab. Before that, he worked at SRI International and at INRIA Sophia-Antipolis as a Computer Scientist. His research interests include shape modeling and motion recovery from images, analysis of microscopy images, and Augmented Reality. He has (co)authored over 300 publications in refereed journals and conferences. He has received several ERC grants. He is an IEEE Fellow and has been an Associate Editor of IEEE journal Transactions for Pattern Analysis and Machine Intelligence. He often serves as program committee member, area chair, and program chair of major vision conferences and has cofounded three spinoff companies.


Video Understanding Beyond Input/Output Learning: Supervision, Multi-modality, and Attention.

Speaker: Gunnar Sigurdsson

Time: Oct 19th, 12pm ET, 2021


Abstract: Over the course of my PhD I worked towards building datasets and models that had an increasingly realistic view into people's daily lives. I will discuss this work in the context of which supervision, modalities, and attention I think will help understand temporal visual data (videos). I'll discuss the evolution of the Charades dataset and recent related datasets. Setting videos up as a standard input/output learning problem can be challenging and I want to discuss our work on using a shared modality as supervision, without any labels. Finally, I'll discuss our work on moving beyond stationary videos, on modeling eye-like movements to increase video understanding.


Bio: Gunnar Sigurdsson helps computers understand people and their activities from video data. Gunnar worked with Abhinav Gupta at Carnegie Mellon University on training AI to understand people, and has collaborated with research labs including Flickr/Yahoo! Labs, Allen Institute of Artificial Intelligence, INRIA, and DeepMind. Gunnar authored the Charades dataset, designed for classification and detection of daily human activities in continuous videos, and the Charades-Ego dataset for first-person video reasoning. (www.gunnar.xyz)





Optimization Inspired Neural Networks for Multiview 3D

Date: Oct 5th, 12pm ET

Speaker: Zachary Teed


Abstract: Multiview 3D has traditionally been approached as an optimization problem. The solution is produced by an algorithm which searches over continuous variables (camera pose, depth, 3D points) to satisfy both geometric constraints and visual observations. In contrast, deep learning offers an alternative strategy where the solution is produced by a general-purpose network with learned weights. In this talk, I will be discussing a hybrid approach for multiview problems, where we explore neural architecture designs inspired by optimization. We’ve used this general strategy to develop accurate and robust systems for optical flow, stereo, scene flow, and visual SLAM.


Bio: Zachary Teed is a 4th year PhD student a Princeton University. He is a member of the Princeton Vision and Learning Lab and advised by Professor Jia Deng. His research focuses on problems in multiview perception including optical flow, stereo, scene flow, and visual SLAM. Previously, Zachary graduated from Washington University in St. Louis with a B.S. in computer science. He has received several awards including the Qualcomm Innovation Fellowship, the Jacobus Fellowship, and the ECCV 2020 Best Paper Award.




How to compute locally invertible maps 

Date: March 2nd, 12pm ET

Speaker: Dmitry Sokolov


Abstract: Mapping a triangulated surface to 2D plane (or a tetrahedral mesh to 3D space) is the most fundamental problem in geometry processing. The critical property of a good map is a (local) invertibility, and it is not an easy one to obtain. We propose a mapping method inspired by the mesh untangling problem. In computational physics, untangling plays an important role in mesh generation: it takes a mesh as an input, and moves the vertices to get rid of foldovers. In fact, mesh untangling can be considered as a special case of mapping, where the geometry of the object is to be defined in the map space and the geometric domain is not explicit, supposing that each element is regular. This approach allows us to produce locally invertible maps, which is the major challenge of mapping. In practice, our method succeeds in very difficult settings, and with less distortion than the previous work, both in 2D and 3D. 


Bio: Dmitry Sokolov is an associate professor at the University of Lorraine and is the head of the research team Pixel ( https://pixel.inria.fr ).




On Reflection

Speaker: Noah Snavely

Date: May 4th, 12pm ET, 2021

Link: https://mit.zoom.us/j/98209522697


Abstract: How can you tell if you are in a mirror world? Let's talk about the slightly askew universe of reflected images.


Bio: Noah Snavely is an associate professor of Computer Science at Cornell University and Cornell Tech, and also a researcher at Google Research. Noah's research interests are in computer vision and graphics, in particular 3D understanding and depiction of scenes from images. Noah is the recipient of a PECASE, a Microsoft New Faculty Fellowship, an Alfred P. Sloan Fellowship, a SIGGRAPH Significant New Researcher Award, and a Helmholtz Prize.


Multi-modal self-supervised learning from videos

Speaker: Adrià Recasens

Time: Tuesday, April 27th. 12pm-1pm ET


Abstract

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. In the first part of the talk, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Furthermore, in the second part of the talk we introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. We demonstrate that MMV and BraVe achieves state of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.



Bio: Adria Recasens is a Research Scientist at DeepMind. He previously completed his PhD on computer vision at the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology in 2019. During his PhD, he worked on various topics related to image and video understanding. Particularly, he has various publications on gaze estimation on image and video. His current research focuses on self-supervised learning specifically applied to multiple modalities such as video, audio or text.



A Paradigm Shift for Perception and Prediction Pipeline in Autonomous Driving

Speaker: Xinshuo Weng

Time: Tuesday, April 20th, 12pm ET


Abstract:

Perception and prediction pipeline (3D object detection and multi-object tracking, trajectory forecasting) is a key component of autonomous systems such as self-driving cars. Although significant advancements have been achieved in each individual module of this pipeline, limited attention is received to improve the pipeline itself. In this talk, I will introduce an alternative to this standard perception and prediction pipeline. In contrast to the standard pipeline, this new pipeline first forecasts LiDAR point clouds using a standard LSTM autoencoder. Then, detection and tracking are performed on the predicted point clouds to obtain future object trajectories. As forecasting the LiDAR sensor data does not require object labels for training, we can scale performance of the Sequential Pointcloud Forecasting (SPF) module in this pipeline. To further improve the SPF module, I will talk about a few techniques that can produce significantly more fine-grained details and predict stochastic futures of LiDAR point clouds.


Bio:

Xinshuo Weng is a Ph.D. student at Robotics Institute of Carnegie Mellon University (CMU) advised by Kris Kitani. She received her master's degree at CMU, where she worked with Yaser Sheikh and Kris Kitani. Prior to CMU, she worked at Facebook Reality Lab as a research engineer to help build “Photorealistic Telepresence”. Her recent research interest lies in 3D Computer Vision and Graph Neural Networks for autonomous systems. She has developed some 3D multi-object tracking systems such as AB3DMOT that has received >1,000 stars on GitHub. Also, she is leading a few autonomous driving workshops at major conferences such as NeurIPS 2020, IJCAI 2021, ICCV 2021. She was awarded a Qualcomm Innovation Fellowship for 2020-2021.





Unsupervised Exemplar Representations: Beyond Task-based Optimization

Speaker: Aayush Bansal

Time: April 6th, 12pm ET


Abstract: We are tuned to think in terms of "tasks" in computer vision, graphics, machine learning, and robotics. Given a problem, we collect a large amount of domain-specific and task-specific data. We then get appropriate human annotations on massive datasets to enable highly-tuned optimizations based on supervised learning. I believe this setup is prohibitive and has hindered progress in the community. In this talk, I will present two unsupervised approaches that build on a simple autoencoder. In the first part of the talk, I will talk about Exemplar Autoencoders trained on an individual's voice and yet generalizes for unknown voices in different languages. I will demonstrate its application as an assistive tool for speech-impaired and multi-lingual translation. In the second part of the talk, I will talk about Video-Specific Autoencoders trained on random frames of a video without temporal information via a simple reconstruction loss. The learned representation allows us to do a wide variety of video analytic tasks such as (but not limited to) spatial and temporal super-resolution, spatial and temporal editing, video textures, average video exploration, and correspondence estimation within and across videos. Neither approach optimizes for the end task and still competes with state-of-the-art supervised methods that do task-specific optimization.


Bio:  Aayush Bansal received his Ph.D. in Robotics from Carnegie Mellon University under the supervision of Prof. Deva Ramanan and Prof. Yaser Sheikh. He was a Presidential Fellow at CMU, and a recipient of the Uber Presidential Fellowship (2016-17), Qualcomm Fellowship (2017-18), and Snap Fellowship (2019-20). His research has been covered by various national and international media such as NBC, CBS, WQED, 90.5 WESA FM, France TV, and Journalist. He has also worked with production houses such as  BBC Studios, Full Frontal with Samantha Bee (TBS), etc. More details are available on his webpage: http://www.cs.cmu.edu/~aayushb/





Video Understanding with Modern Language Models

Speaker: Gedas Bertasius

Time: Tuesday, March 30th 2021. 12pm ET.


Abstract: 

Humans understand the world by processing signals from both vision and language. Similarly, we believe that language can be useful for developing better video understanding systems. In this talk, I will present several video understanding frameworks that incorporate models from the language domain.

 

First, I will introduce TimeSformer, the first convolution-free architecture for video modeling built exclusively with self-attention. It achieves the best reported numbers on major action recognition benchmarks, and it is also more efficient than the state-of-the-art 3D CNNs. Afterwards, I will present COBE, a new large-scale framework for learning contextualized object representations in settings involving human-object interactions. Our approach exploits automatically-transcribed speech narrations from instructional YouTube videos, and it does not require manual annotations. Lastly, I will introduce a multi-modal video-based text generation framework Vx2Text, which outperforms state-of-the-art on three video based text-generation tasks: captioning, question answering and dialoguing.


Bio : 

Gedas Bertasius is a postdoctoral researcher at Facebook AI working on computer vision and machine learning problems. His current research focuses on topics of video understanding, first-person vision, and multi-modal deep learning. He received his Bachelors Degree in Computer Science from Dartmouth College, and a Ph.D. in Computer Science from the University of Pennsylvania. His recent work was nominated for the CVPR 2020 best paper award.


Self-Supervised 3D Digitization of Faces

Speaker: Ayush Tewari

Time: Tuesday, March 9th 2021. 12pm ET.


Abstract: Photorealistic and semantically controllable digital models of human faces are important for a wide range of applications in movies, virtual reality, and casual photography. Recent approaches have explored digitizing faces from a single image using priors commonly known as 3D morphable models (3DMMs). In this talk, I will discuss methods for high-quality monocular 3D reconstruction by learning 3DMMs from 2D data such as videos. Learning from 2D data allows for better generalization compared to training on limited 3D scans. I will also talk about methods for  photorealistic editing of portrait images using these 3D models. Our method learns a mapping between the latent spaces of a 3DMM and StyleGAN, enabling semantically meaningful control over portrait images. Through these methods, I will demonstrate that a single 3D face scan, combined with image and video datasets can enable disentangled editing of the head pose, scene illumination, and facial expressions in portrait images at photorealistic quality.


Bio : Ayush Tewari is a Ph.D. student working with Prof. Christian Theobalt at the Max Planck Institute for Informatics in Saarbruecken, Germany. He received his M.Sc. in Computer Science from Grenoble INP, and B.Tech. in Computer Science and Engineering from IIIT Hyderabad. His research interests are in computer vision, computer graphics, and machine learning, with a focus on self-supervised 3D reconstruction and synthesis problems.







On novel and perpetual view synthesis

Date: February 23rd, 12pm ET

Speaker: Angjoo Kanazawa


Abstract: 2020 was a turbulent year, but for 3D learning it was a fruitful one with lots of exciting new tools and ideas. In particular, there have been many exciting developments in the area of coordinate based neural networks and novel view synthesis. In this talk I will discuss our recent work on single image view synthesis with pixelNeRF, which aims to predict a Neural Radiance Field (NeRF) from a single image. I will discuss how NeRF representation allows models like pixel-aligned implicit functions (PiFu) to be trained without explicit 3D supervision and the importance of other key design factors such as predicting in view coordinate-frame and handling multi-view inputs. Then, I will discuss Infinite Nature, a project in collaboration with teams at Google NYC, where we explore how to push the boundaries of novel view synthesis and generate views way beyond the edges of the initial input image, resulting in a controllable video generation of a natural scene. 


Bio: Angjoo Kanazawa is an Assistant Professor in the Department of Electrical Engineering and Computer Science at the University of California at Berkeley. Previously, she was a BAIR postdoc at UC Berkeley advised by Jitendra Malik, Alexei A. Efros and Trevor Darrell. She completed her PhD in CS at the University of Maryland, College Park with her advisor David Jacobs. Prior to UMD, she obtained her BA in Mathematics and Computer Science at New York University. She has also spent time at the Max Planck Institute for Intelligent Systems with Michael Black and Google NYC with Noah Snavely. Her research is at the intersection of computer vision, graphics, and machine learning, focusing on 4D reconstruction of the dynamic world behind everyday photographs and video. 






Learning the Predictability of the Future

Date: February 16th, 2021

Speaker: Didac Suris


Abstract:

Not everything in the future is predictable. We cannot anticipate the outcomes of coin flips, and we cannot forecast the exact trajectory of a person walking. Selecting what to predict is therefore a central issue for future prediction. In this talk, I will introduce a framework for learning from unlabeled video what is predictable in the future. Instead of committing up front to features to predict, our approach learns from data which features are predictable. Based on the observation that hyperbolic geometry naturally and compactly encodes hierarchical structure, we propose a predictive model in hyperbolic space. When the model is most confident, it will predict at a concrete level of the hierarchy, but when the model is not confident, it learns to automatically select a higher level of abstraction. Although our representation is trained with unlabeled video, visualizations show that action hierarchies emerge in the representation.


Bio:

Dídac Surís is a second-year PhD student in computer vision at Columbia University, advised by Prof. Carl Vondrick. He obtained his BS and MS in telecommunications at the Polytechnic University of Catalonia, in Barcelona. Before starting his PhD, he interned with the research team at Telefonica, and he conducted research stays at MIT with Prof. Antonio Torralba and at University of Toronto with Prof. Sanja Fidler. His research interests include multimodal machine learning, video prediction, and self-supervised representation learning. More generally, he is interested in the areas of artificial intelligence that exploit all the available information using as little human supervision as possible.





Compositional Reasoning in Robot Learning

Date: February 2nd, 2021

Speaker: Danfei Xu


Abstract: To develop robots that can operate in everyday human environments, we must enable the robots to generalize beyond their prior knowledge. While today’s data-driven robot learning approaches can learn from sensoriomotor data with minimum assumptions about the environment, they are often limited to learning one task at a time, starting from scratch.  At the same time, humans perform everyday tasks with ease. But instead of learning each task in silo, we distill reusable abstractions from our daily experiences. This allows us to quickly solve new tasks by composing known building blocks. Such compositional reasoning capability is crucial for developing future robots that are both competent and flexible.

    In this talk, I will present my works on building compositional reasoning capabilities into sensorimotor learning systems. I will start by presenting a core set of works that imposes strong structural priors on learning algorithms, and show that they can enable efficient learning and systematic generalization across complex, long-horizon manipulation tasks. Then I will present some of our recent efforts on relaxing assumptions in these methods to bring the idea of compositional reasoning to real-world settings.


Bio: Danfei Xu is a Ph.D. student at Stanford University advised by Professor Silvio Savarese and Professor Fei-Fei Li. His research lies in the intersection of robotics, computer vision, and machine learning. His research goal is to build autonomous agents that can operate in everyday human environments. He obtained a B.S. from Columbia University in 2015 and has spent time at CMU Robotics Institute, Columbia Robotics Lab, Autodesk Research, Zoox, and DeepMind.




Building differentiable models of the 3D world

Speaker: Krishna Murthy

Time: Tuesday, January 19th. 1pm ET


Abstract:

Modern machine learning has created exciting new opportunities for the design of intelligent scene understanding systems. In particular, gradient-based learning methods have tremendously improved 3D scene understanding in terms of perception, reasoning, and action. However these advancements have undermined many "classical" techniques developed over the last few decades. I postulate that a flexible blend of "classical" and learned methods is the most promising path to developing flexible, interpretable, and actionable models of the world: a necessity for intelligent embodied agents.


While modern learning-based scene understanding systems have shown experimentally promising results in simulated scenarios, they fail in unpredictable and unintuitive ways when deployed in real-world applications. Classical systems, on the other hand, offer guarantees and bounds on performance and generalization, but often require heavy handcrafting and oversight. My research aims to deely integrate classical and learning-based techniques to bring the best of both worlds, by building "differentiable models of the 3D world". I will talk about two particular recent efforts along these directions.

1. gradSLAM - a fully differentiable dense SLAM system that can be plugged as a "layer" into neural nets

2. gradSim - a differentiable simulator comprising a physics engine and a renderer, to enable physical parameter estimation and visuomotor control from video


Bio:

Krishna Murthy is a PhD candidate at the Robotics and Embodied AI lab and Mila at the University of Montreal, working on differentiable adaptations of physical processes (computer vision, robotics, graphics, physics, and optimization) and their applicability in modern learning pipelines. He has organized the "differentiable vision, graphics, and physics" workshop at Neurips 2020 and is organizing the "rethinking ML papers" workshop at ICLR 2021. His work has been recognized with an NVIDIA graduate fellowship (2021) and a best paper award from Robotics and Automation letters in 2019. He was also chosen to the RSS Pioneers cohort in 2020.



Surprises in the quest for robustness in ML 

Speaker: Aditi Raghunathan

Date: Tuesday,  January 12, 12pm ET


Abstract: Standard machine learning produces models that are highly accurate on average but that degrade dramatically when the test distribution deviates from the training distribution. While one can train robust models, this often comes at the expense of standard accuracy (on the training distribution). We study this tradeoff in two settings, adversarial examples and minority groups, creating simple examples which highlight generalization issues as a major source of this tradeoff. For adversarial examples, we show that even augmenting with correctly annotated data to promote robustness can produce less accurate models, but we develop a simple method, robust self training, that mitigates this tradeoff using unlabeled data. For minority groups, we show that overparametrization of models can hurt accuracy on the minority groups, though it improves standard accuracy. These results suggest that the "more data" and "bigger models" strategy that works well for the standard setting where train and test distributions are close, need not work on out-of-domain settings.


Bio: Aditi Raghunathan is a fifth year PhD student at Stanford University working with Percy Liang. She is interested in building robust ML systems that can be deployed in the wild. She is a recipient of the Open Philanthropy AI Fellowship and the Google PhD fellowship in Machine Learning.


Title: Machine Learning for Deep Image Manipulation

Speaker: Taesung Park

Date: December 8th, 12PM ET


Abstract: Deep generative models, such as Generative Adversarial Networks (GANs) in particular, can sample realistic images from Gaussian noise. However, are they good for image editing? Image editing requires the output to retain some resemblance to the user-provided input image. In this talk, I will discuss a different formulation in which the generator network is trained to transform one image to another. I will explore some interesting ways to constrain the generator to respect the input images, and show that they are indeed useful for image editing and other practical tasks.


Bio: Taesung Park is a Ph.D. student at UC Berkeley, advised by Prof. Alexei Efros, focusing on computer vision and learning-based computational photography. He worked on several projects related to image synthesis, including CycleGAN (co-first author, 5000+ citations) and GauGAN (Best Paper Finalist at CVPR19 and Best in Show Award at SIGGRAPH19 Real-Time Live). He received a B.S. in Math and M.S. in Computer Science at Stanford, working with Vladlen Koltun and Sergey Levine. He is a recipient of Samsung Scholarship and Adobe Research Fellowship 2020.




Title: Learning to Create and Label Data

Speaker: Amlan Kar

Time: Tuesday December 1st, 12pm ET


Abstract: Labelled Data is the workhorse that has led exponential growth in the field of Machine Learning, and common wisdom now posits that better data usually leads to a better downstream model. In this talk, I will present and discuss exploration into how learnt models can be used to 1) improve interactive labeling of data with human labelers and 2) automatically create labelled data by learning to simulate scenes inside graphics engines. First, I will talk about and demo the Toronto Annotation Suite, a platform for data annotation and management built with various learnt methods from our group at its core. Next, I will present our work on Meta-Sim (ICCV 2019) and Meta-Sim2 (ECCV 2020) on generative models of simulated 3D scenes in graphics engines. Finally, time permitting, I will talk about Fed-Sim (MICCAI 2020) that extends this framework to the medical domain, allowing simulation and generation of synthetic CT volumes by modeling human organs and the CT rendering process.


Bio: Amlan is a graduate student at the University of Toronto advised by Prof. Sanja Fidler, and a Research Scientist at NVIDIA Research. He is broadly interested in Computer Vision problems and their applications, particularly in how labelled data can be collected quicker, or generated automatically. He graduated from IIT Kanpur, India in 2017 with a bachelor’s degree where he worked with Prof. Gaurav Sharma on Action Recognition and earlier with Prof. Amitabha Mukerjee. He has previously interned with the research team at Fyusion Inc. and with Prof. Raquel Urtasun and Prof. Sanja Fidler at the University of Toronto.





Title: On the Capability of CNNs to Generalize to Unseen Category-Viewpoint Combinations

Time: Tuesday November 3rd, 12pm ET

Speaker: Spandan Madan



Abstract:

Humans can effortlessly recognize objects from previously unseen viewpoints. However, recent work suggests that CNNs fail to understand such unseen category-viewpoint combinations not seen during training. It is unclear when and how such generalization may be possible---Does the number of combinations seen during training impact generalization? What architectures better enable generalization in the multi-task setting of simultaneous category and viewpoint classification? Furthermore, what are the underlying mechanisms that drive the network's generalization behaviour?

In this presentation, we present our findings on answering these questions by analyzing state-of-the-art CNNs trained for simultaneous object classification and 3D viewpoint estimation, with quantitative control over the number of category-viewpoint combinations seen during training. We also investigate the emergence of two types of specialized neurons that can explain generalization to unseen combinations—neurons selective to category and invariant to viewpoint, and vice versa. We present our analysis on multiple network backbones and datasets including MNIST extended with position or scale, the iLab dataset with vehicles at different viewpoints, and a challenging new dataset for car model recognition and viewpoint estimation that we introduce in this paper - the Biased-Cars dataset.


Bio: 

Spandan is a PhD student at Harvard SEAS advised by Hanspeter Pfister and closely collaborating with the Center for Brains, Minds and Machines (CBMM) at MIT. His research focuses on building controlled environments and tools for better understanding and debugging computer vision models. Before this, Spandan completed his M.E. in Computational Science and Engineering at Harvard, and his undergrad at IIT Delhi (India). He was a recipient of the Snap Research Scholarship in 2018, and a Harvard SEAS fellowship in 2017. Spandan has also worked as a visiting research assistant at MIT, and as a research intern at Microsoft Research and Adobe Research in the past.





Title: Inverting Latent Space of GANs for Real Image Editings

Speaker: Bolei Zhou

Date: Tuesday, October 27th, 12pm ET

Abstract:

Recent progress in deep generative models such as Generative Adversarial Networks (GANs) has enabled synthesizing photo-realistic images, such as faces and scenes. However, it remains much less explored on what has been learned inside the deep representations learned from synthesizing images. In this talk, I will present our recent series work from GenForce (https://genforce.github.io/) on interpreting and utilizing latent space of the GANs. Identifying these semantics not only allows us to better understand the internals of the generative models, but also facilitates versatile real image editing applications. Lastly, I will briefly talk about our recent effort of using generative modeling to improve the generalization of end-to-end autonomous driving.

Bio:

Bolei Zhou is an Assistant Professor with the Information Engineering Department at the Chinese University of Hong Kong. He received his PhD in computer science at the Massachusetts Institute of Technology. His research is on machine perception and decision making, with a focus on enabling interpretable human-AI interactions. He received the MIT Tech Review’s Innovators under 35 in Asia-Pacific Award, Facebook Fellowship, Microsoft Research Asia Fellowship, MIT Greater China Fellowship, and his research was featured in media outlets such as TechCrunch, Quartz, and MIT News. More about his research is at http://bzhou.ie.cuhk.edu.hk/.



Title: Understanding and Extending Neural Radiance Fields

Speaker: Jon Barron

Time: Tuesday, October 13th, 12pm ET

Zoom: https://mit.zoom.us/j/93872275009



Abstract: Neural Radiance Fields (Mildenhall, Srinivasan, Tancik, et al., ECCV 2020) are an effective and simple technique for synthesizing photorealistic novel views of complex scenes by optimizing an underlying continuous volumetric radiance field, parameterized by a (non-convolutional) neural network. I will discuss and review NeRF and then introduce two works that closely relate to it: First, I will explain why NeRF (and other CPPN-like architectures that map from low-dimensional coordinates to intensities) depend critically on the use of a trigonometric "positional encoding", aided by insights provided by the neural tangent kernel literature. Second, I will show how NeRF can be extended to incorporate explicit reasoning about occluders and appearance variation, and can thereby enable photorealistic view synthesis and photometric manipulation using only unstructured image collections.



Bio: Jon Barron is a staff research scientist at Google, where he works on computer vision and machine learning. He received a PhD in Computer Science from the University of California, Berkeley in 2013, where he was advised by Jitendra Malik, and he received a Honours BSc in Computer Science from the University of Toronto in 2007. He received a National Science Foundation Graduate Research Fellowship in 2009, the C.V. Ramamoorthy Distinguished Research Award in 2013, the PAMI Young Researcher Award in 2020, and the ECCV Best Paper Honorable Mention in both 2016 and 2020.



3D Vision with 3D View-Predictive Neural Scene representations

Speaker: Katerina Fragkiadaki

Date: Tuesday, September 29th. 12pm - 1pm

Link: https://mit.zoom.us/j/98605309245


Abstract: Current state-of-the-art CNNs localize rare  object categories in internet photos, yet, they miss basic facts that a two-year-old has mastered: that objects have 3D extent, they persist over time despite changes in the camera view, they do not 3D intersect, and others.  We will discuss models that learn to  map 2D and 2.5D  images and videos into amodal completed  3D feature maps of the scene and the objects in it by predicting views. We will show the proposed models learn object permanence, have objects  emerge in 3D without human annotations, support grounding of language in 3D visual simulations, and learn intuitive physics that generalize across scene arrangements and camera configurations. In this way, they overcome many limitations of 2D CNNs for video perception, model learning and language grounding.


Bio: Katerina Fragkiadaki is an Assistant Professor in the Machine Learning Department in Carnegie Mellon University. She received her Ph.D. from University of Pennsylvania and was a postdoctoral fellow in UC Berkeley and Google research after that.  Her work is on learning visual representations with little supervision and on combining spatial reasoning into deep visual learning.  Her  group develops algorithms for mobile computer vision,  learning of physics and common sense for agents that move around and interact with the world.  Her work has been awarded with a best Ph.D. thesis award,  an NSF CAREER award, Google and Sony faculty research awards. 



Towards continual and compositional few-shot learning

Speaker: Mengye Ren

Time: Tuesday, 09/22 12pm ET

Zoom: https://mit.zoom.us/j/91553306682


Abstract: Few-shot learning has recently emerged as a popular area of research towards building more flexible machine learning programs that can adapt at test time. However, it now faces two major criticisms. First, the “k-shot n-way” episodic structure is still far from modelling the incremental knowledge acquisition procedure of an agent in a natural environment. Secondly, there has been limited improvement towards modelling compositional understanding of novel objects; on the other hand, features obtained from regular classification tasks can perform very well. In this talk I will introduce two recent advances that address each of the challenges above.


Bio: Mengye Ren is a PhD student in the machine learning group of the Department of Computer Science at the University of Toronto. His academic advisor is Prof. Richard Zemel. He also works as a research scientist at Uber Advanced Technologies Group (ATG) Toronto, directed by Prof. Raquel Urtasun, doing self-driving related research. His research interests are meta-learning, few-shot learning, and continual learning. He studied undergrad in Engineering Science at the University of Toronto.





AI for Self-Driving at Scale

Speaker: Raquel Urtasun

Time: Wednesday, September 16. 2pm ET.



Abstract: Abstract: We are on the verge of a new era in which robotics and artificial intelligence will play an important role in our daily lives. Self-driving vehicles have the potential to redefine transportation as we understand it today. Our roads will become safer and less congested, while parking spots will be repurposed as leisure zones and parks. However, many technological challenges remain as we pursue this future.

In this talk I will showcase the latest advancements made by Uber Advanced Technologies Group’s Research Lab in the quest towards self-driving vehicles.


Bio: Raquel Urtasun is the Chief Scientist of Uber ATG and the Head of Uber ATG Toronto. She is also a Professor in the Department of Computer Science at the University of Toronto, a Canada Research Chair in Machine Learning and Computer Vision and a co-founder of the Vector Institute for AI. She received her Ph.D. degree from the Computer Science department at Ecole Polytechnique Federal de Lausanne (EPFL) in 2006 and did her postdoc at MIT and UC Berkeley. She is a world leading expert in AI for self-driving cars. Her research interests include machine learning, computer vision, robotics and remote sensing. Her lab was selected as an NVIDIA NVAIL lab. She is a recipient of an NSERC EWR Steacie Award, an NVIDIA Pioneers of AI Award, a Ministry of Education and Innovation Early Researcher Award, three Google Faculty Research Awards, an Amazon Faculty Research Award, a Connaught New Researcher Award, a Fallona Family Research Award  and two Best Paper Runner up Prize awarded at CVPR in 2013 and 2017 respectively. She was also named Chatelaine 2018 Woman of the year, and 2018 Toronto's top influencers by Adweek magazine.




Title: Learning Semantic Models from Video with Zero or Few Labels

Speaker: Lorenzo Torresani

Time: Tuesday, September 1st. 12pm ET

Abstract: In this talk I will present our recent work on learning semantic models from video with no or limited supervision.

I will begin by describing a self-supervised model that leverages cross-modal clustering of audio and video. As an unsupervised pre-training strategy, it yields action recognition accuracy superior to that achieved with fully-supervised pre-training on the large-scale Kinetics dataset.

In the second part of my talk I will present a method that extends object detectors beyond traditional categorization to recognize contextual information, such as the object state, the action applied to it, or accompanying objects near it. The approach exploits automatically-transcribed narrations from instructional videos, and it does not require any manual annotations.

Finally, I will conclude by introducing MaskProp, a unifying approach for classifying, segmenting and tracking object instances in video. It achieves the best reported accuracy on the YouTube-VIS dataset, outperforming the closest competitor despite being trained on 1000x fewer images and 10x fewer bounding boxes. 



Bio: Lorenzo Torresani is a Professor in the Computer Science Department at Dartmouth College and a Research Scientist at Facebook AI. He received a Laurea Degree in Computer Science with summa cum laude honors from the University of Milan (Italy) in 1996, and an M.S. and a Ph.D. in Computer Science from Stanford University in 2001 and 2005, respectively. In the past, he has worked at several industrial research labs including Microsoft Research Cambridge, Like.com and Digital Persona. His research interests are in computer vision and deep learning. He is the recipient of several awards, including a CVPR best student paper prize, a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright U.S. Scholar Award.


Title: Fairness in visual recognition

Speaker: Olga Russakovsky

Date: Tuesday, August 18th. 12pm ET


Abstract: Computer vision models trained on unparalleled amounts of data hold promise for making impartial, well-informed decisions in a variety of applications. However, more and more historical societal biases are making their way into these seemingly innocuous systems. Visual recognition models have exhibited bias by inappropriately correlating age, gender, sexual orientation and race with a prediction. The downstream effects of such bias range from perpetuating harmful stereotypes on an unparalleled scale to increasing the likelihood of being unfairly predicted as a suspect in a crime. In this talk, we'll dive deeper both into the technical reasons and the potential solutions for algorithmic fairness in computer vision. We will discuss our recent proposed solutions (CVPR 2020 https://arxiv.org/abs/1911.11834, FAT*2020 http://image-net.org/filtering-and-balancing/, ECCV 2020 https://github.com/princetonvisualai/revise-tool) as well as ongoing efforts. If you're interested in collaborating on these topics I'd love to meet after the talk.


Bio: Dr. Olga Russakovsky is an Assistant Professor in the Computer Science Department at Princeton University. Her research is in computer vision, closely integrated with the fields of machine learning, human-computer interaction and fairness, accountability and transparency. She has been awarded the AnitaB.org's Emerging Leader Abie Award in honor of Denice Denton in 2020, the CRA-WP Anita Borg Early Career Award in 2020, the MIT Technology Review's 35-under-35 Innovator award in 2017, the PAMI Everingham Prize in 2016 and Foreign Policy Magazine's 100 Leading Global Thinkers award in 2015. In addition to her research, she co-founded and continues to serve on the Board of Directors of the AI4ALL foundation dedicated to increasing diversity and inclusion in Artificial Intelligence (AI). She completed her PhD at Stanford University in 2015 and her postdoctoral fellowship at Carnegie Mellon University in 2017.





Title: Point-based object detection

Speaker: Philipp Krähenbühl 

Date: Tuesday, August 11th. 12pm ET


Abstract:

Objects are commonly thought of as axis-aligned boxes in an image. Even before deep learning, the best performing object detectors classified rectangular image regions. On one hand, this approach conveniently reduces detection to image classification. On the other hand, it has to deal with a nearly exhaustive list of image regions that do not contain any objects. In this talk, I'll present an alternative representation of objects: as points. I'll show how to build an object detector from a keypoint detector of object centers. The presented approach is both simpler and more efficient (faster and/or more accurate) than equivalent box-based detection systems. Our point-based detector easily extends to other tasks, such as object tracking, monocular or Lidar 3D detection, and pose estimation.


Most detectors, including ours, are usually trained on a single dataset and then evaluated in that same domain. However, it is unlikely that any user of an object detection system only cares about 80 COCO classes or 23 nuScenes vehicle categories in isolation. More likely than not, object classes needed in a down-stream system are either spread over different data-sources or not annotated at all. In the second part of this talk, I'll present a framework for learning object detectors on multiple different datasets simultaneously. We automatically learn the relationship between different objects annotations in different datasets and automatically merge them into common taxonomy. The resulting detector then reasons about the union of object classes from all datasets at once. This detector is also easily extended to unseen classes by fine-tuning it on a small dataset with novel annotations.


Bio:

Philipp is an Assistant Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 2014 from the CS Department at Stanford University and then spent two wonderful years as a PostDoc at UC Berkeley. His research interests lie in Computer Vision, Machine learning, and Computer Graphics. He is particularly interested in deep learning, image understanding, and vision and action.





Title: Self-supervised Reconstruction and Interaction

Speaker: Shubham Tulsiani

Date: August 4th, 2020. 12pm ET

Abstract: We live in a physical world, and any artificially intelligent agents must understand and act in it. In this talk, I will present some recent efforts towards building systems that can infer the spatial and physical structure underlying their visual percepts and can further leverage such understanding for acting efficiently. Across these projects, I will highlight how incorporating our knowledge about the laws of geometry and physics can help bypass the need for tedious manual supervision for learning, thereby allowing us to learn 3D reconstruction and interaction in a self-supervised manner. 


Bio: Shubham Tulsiani is a research scientist at Facebook AI Research (FAIR) and will be joining the CMU School of Computer Science as an Assistant Professor in Fall 2021. He received a PhD. in Computer Science from UC Berkeley under the supervision of Jitendra Malik in 2018. He is interested in building perception systems that can infer the spatial and physical structure of the world they observe. His work was awarded the 'Best Student Paper Award' at CVPR 2015.




Title: Some Vision + Language, more AI + Creativity

Speaker: Devi Parikh

Date: Tuesday, July 28. 12pm ET


Abstract: I will give an informal talk describing our recent work in vision + language, and early explorations in AI + Creativity. I hope for this to be a casual and interactive setting, where I'll take questions, comments, and feedback along the way. If time permits, and there is interest, I'd also be happy to chat about time management or anything else I've written about in my blog posts :) https://medium.com/@deviparikh (or anything else, really).

Some more specifics: I will talk briefly about our work on training a transformer-based model for multiple (12) vision and language tasks. I will spend some time showing you a demo of its capabilities. I will then talk about some of our initial work in seeing how AI can inspire human creativity in the context of thematic typography and dance movements, studying various human-human collaboration settings for creating digital sketches on a web interface, generating a visual abstraction that summarizes how your day was, neuro-symbolic generative art, etc. 


Bio: Devi Parikh is an Associate Professor in the School of Interactive Computing at Georgia Tech, and a Research Scientist at Facebook AI Research (FAIR).

From 2013 to 2016, she was an Assistant Professor in the Bradley Department of Electrical and Computer Engineering at Virginia Tech. From 2009 to 2012, she was a Research Assistant Professor at Toyota Technological Institute at Chicago (TTIC), an academic computer science institute affiliated with University of Chicago. She has held visiting positions at Cornell University, University of Texas at Austin, Microsoft Research, MIT, Carnegie Mellon University, and Facebook AI Research. She received her M.S. and Ph.D. degrees from the Electrical and Computer Engineering department at Carnegie Mellon University in 2007 and 2009 respectively. She received her B.S. in Electrical and Computer Engineering from Rowan University in 2005.


Her research interests are in computer vision, natural language processing, embodied AI, human-AI collaboration, and AI for creativity.


She is a recipient of an NSF CAREER award, an IJCAI Computers and Thought award, a Sloan Research Fellowship, an Office of Naval Research (ONR) Young Investigator Program (YIP) award, an Army Research Office (ARO) Young Investigator Program (YIP) award, a Sigma Xi Young Faculty Award at Georgia Tech, an Allen Distinguished Investigator Award in Artificial Intelligence from the Paul G. Allen Family Foundation, four Google Faculty Research Awards, an Amazon Academic Research Award, a Lockheed Martin Inspirational Young Faculty Award at Georgia Tech, an Outstanding New Assistant Professor award from the College of Engineering at Virginia Tech, a Rowan University Medal of Excellence for Alumni Achievement, Rowan University’s 40 under 40 recognition, a Forbes’ list of 20 “Incredible Women Advancing A.I. Research” recognition, and a Marr Best Paper Prize awarded at the International Conference on Computer Vision (ICCV).


https://www.cc.gatech.edu/~parikh



Title: Our recent research on 3D Deep Learning

Speaker: Vittorio Ferrari

Date: Tuesday, July 14th - 12pm ET


Abstract: 

I will present three recent projects within the 3D Deep Learning research line from my team at Google Research:

(1) a deep network for reconstructing the 3D shape of multiple objects appearing in a single RGB image (ECCV'20).

(2) a new conditioning scheme for normalizing flow models. It enables several applications such as reconstructing an object's 3D point cloud from an image, or the converse problem of rendering an image given a 3D point cloud, both within the same modeling framework (CVPR'20);

(3) a neural rendering framework that maps a voxelized object into a high quality image. It renders highly-textured objects and illumination effects such as reflections and shadows realistically. It allows controllable rendering: geometric and appearance modifications in the input are accurately represented in the final rendering (CVPR'20);


Bio: Vittorio Ferrari is a Senior Staff Research Scientist at Google, where he leads a research group on visual learning. He received his PhD from ETH Zurich in 2004, then was a post-doc at INRIA Grenoble (2006-2007) and at the University of Oxford (2007-2008). Between 2008 and 2012 he was an Assistant Professor at ETH Zurich, funded by a Swiss National Science Foundation Professorship grant. In 2012-2018 he was faculty at the University of Edinburgh, where he became a Full Professor in 2016 (now he is a Honorary Professor). In 2012 he received the prestigious ERC Starting Grant, and the best paper award from the European Conference in Computer Vision. He is the author of over 120 technical publications. He regularly serves as an Area Chair for the major computer vision conferences, he was a Program Chair for ECCV 2018 and is a General Chair for ECCV 2020. He is an Associate Editor of IEEE Pattern Analysis and Machine Intelligence. His current research interests are in learning visual models with minimal human supervision, human-machine collaboration, and 3D Deep Learning.





Title: Contrastive Learning: A General Self-supervised Learning Approach

Speaker: Yonglong Tian

Date: Tuesday, June 30th, 12pm ET

Abstract:

Self-supervised learning aims at learning effective visual representations without human annotations, and is a long-standing problem. Recently, contrastive learning between multiple views of the data has significantly improved the state-of-the-art in the field of self-supervised learning.Despite its success, the influence of different view choices has been less studied.

In this talk, I will firstly summarise recent progresses on contrastive representation learning from a unified multi-view perspective. Then an InfoMin principle is proposed that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we also devise unsupervised and semi-supervised frameworks to learn effective views. Lastly, I will extend the application of contrastive learning beyond self-supervised learning.


Bio:

Yonglong Tian is currently a PhD student at MIT, working with Prof. Phillip Isola and Prof. Joshua Tenenbaum. Yonglong’s general research interests lie in the intersection of machine perception, learning and reasoning, mainly from the perspective of vision. These days he focuses more on unsupervised representation learning and visual program induction.




Title: Explorable Super Resolution

Speaker: Yuval Bahat

Time: Tuesday, June 23rd. 12pm EST


Abstract:

Single image super resolution (SR) has seen major performance leaps in recent years. However, existing methods do not allow exploring the infinitely many plausible reconstructions that might have given rise to the observed low-resolution (LR) image. These different explanations to the LR image may dramatically vary in their textures and fine details, and may often encode completely different semantic information. In this work, we introduce the task of explorable super resolution. We propose a framework comprising a graphical user interface with a neural network backend, allowing editing the SR output so as to explore the abundance of plausible HR explanations to the LR input. At the heart of our method is a novel module that can wrap any existing SR network, analytically guaranteeing that its SR outputs would precisely match the LR input, when downsampled. Besides its importance in our setting, this module is guaranteed to decrease the reconstruction error of any SR network it wraps, and can be used to cope with blur kernels that are different from the one the network was trained for. We illustrate our approach in a variety of use cases, ranging from medical imaging and forensics, to graphics.


Bio:

Yuval is a postdoctoral researcher working with Prof. Tomer Michaeli at the Technion. His research focuses on the intersection of computer vision and audio processing with Machine learning. He completed his PhD at the Weizmann Institute of Science, where his advisor was Prof. Michal Irani. Previously, he completed his M.Sc. at the Technion, where he was advised by Prof. Yoav Y. Schechner.



Title: Why Neural Rendering is Super Cool!

Speaker: Matthias Nießner

Date: Tuesday, May 19th 12pm EST

Link: https://mit.zoom.us/j/97312320018


Abstract

In this talk, I will present my research vision in how to create photo-realistic digital replica of the real world, and how to make holograms become a reality. Eventually, I would like to see photos and videos evolve to become interactive, holographic content indistinguishable from the real world. Imagine taking such 3D photos to share with friends, family, or social media; the ability to fully record historical moments for future generations; or to provide content for upcoming augmented and virtual reality applications. AI-based approaches, such as generative neural networks, are becoming more and more popular in this context since they have the potential to transform existing image synthesis pipelines. I will specifically talk about an avenue towards neural rendering where we can retain the full control of a traditional graphics pipeline but at the same time exploit modern capabilities of deep learning, such has handling the imperfections of content from commodity 3D scans.

While the capture and photo-realistic synthesis of imagery opens up unbelievable possibilities for applications ranging from entertainment to communication industries, there are also important ethical considerations that must be kept in mind. Specifically, in the content of fabricated news (e.g., fakenews), it is critical to highlight and understand digitally-manipulated content. I believe that media forensics plays an important role in this area, both from an academic standpoint to better understand image and video manipulation, but even more importantly from a societal standpoint to create and raise awareness around the possibilities and moreover, to highlight potential avenues and solutions regarding trust of digital content.


Bio:

Dr. Matthias Nießner is a Professor at the Technical University of Munich where he leads the Visual Computing Lab. Before, he was a Visiting Assistant Professor at Stanford University. Prof. Nießner’s research lies at the intersection of computer vision, graphics, and machine learning, where he is particularly interested in cutting-edge techniques for 3D reconstruction, semantic 3D scene understanding, video editing, and AI-driven video synthesis. In total, he has published over 70 academic publications, including 22 papers at the prestigious ACM Transactions on Graphics (SIGGRAPH / SIGGRAPH Asia) journal and 18 works at the leading vision conferences (CVPR, ECCV, ICCV); several of these works won best paper awards, including at SIGCHI’14, HPG’15, SPG’18, and the SIGGRAPH’16 Emerging Technologies Award for the best Live Demo. Prof. Nießner’s work enjoys wide media coverage, with many articles featured in main-stream media including the New York Times, Wall Street Journal, Spiegel, MIT Technological Review, and many more, and his was work led to several TV appearances such as on Jimmy Kimmel Live, where Prof. Nießner demonstrated the popular Face2Face technique; Prof. Nießner’s academic Youtube channel currently has over 5 million views. For his work, Prof. Nießner received several awards: he is a TUM-IAS Rudolph Moessbauer Fellow (2017 – ongoing), he won the Google Faculty Award for Machine Perception (2017), the Nvidia Professor Partnership Award (2018), as well as the prestigious ERC Starting Grant 2018 which comes with 1.500.000 Euro in research funding; in 2019, he received the Eurographics Young Researcher Award honoring the best upcoming graphics researcher in Europe. In addition to his academic impact, Prof. Nießner is a co-founder and director of Synthesia Inc., a brand-new startup backed by Marc Cuban, whose aim is to empower storytellers with cutting-edge AI-driven video synthesis.



Title: A fine(r)-grained perspective onto object interactions 

Speaker: Dima Damen

Date: Tuesday, May 12th 2020


Abstract: 

This talk aims to argue for a fine(r)-grained perspective onto human-object interactions, from video sequences. I will present approaches for determining skill or expertise from video sequences [CVPR 2019], assessing action ‘completion’ – i.e. when an interaction is attempted but not completed [BMVC 2018], dual-domain and dual-time learning [CVPR 2020, CVPR 2019, ICCVW 2019] as well as multi-modal approaches using vision, audio and language [CVPR 2020, ICCV 2019, BMVC 2019].

I will also introduce EPIC-KITCHENS [ECCV 2018], and its upcoming extension [Under Review], the largest egocentric dataset in people’s homes. The dataset now includes 20M frames of 90K action segments and 100 hours of recording fully annotated with objects and actions, based on unique annotations from the participants narrating their own videos, thus reflecting true intention. [http://epic-kitchens.github.io


Bio:

Reader (Associate Professor) in Computer Vision at the University of Bristol, United Kingdom. Received her PhD from the University of Leeds, UK (2009). Dima is currently an EPSRC Fellow (2020-2025), focusing on her research interests in the automatic understanding of object interactions, actions and activities using static and wearable visual (and depth) sensors. Dima is a program chair for ICCV 2021 in Montreal, associate editor of IJCV (2020-), IEEE TPAMI (2019-) and Pattern Recognition (2017-). She was selected as a Nokia Research collaborator in 2016, and as an Outstanding Reviewer in ICCV17, CVPR13 and CVPR12. She currently supervises 5 PhD students, and 4 postdoctoral researchers. More details at: [http://dimadamen.github.io]





Title: Overcoming Data Scarcity in Deep Learning

Speaker: David Acuna

Date: Tuesday, April 28th. 12pm

Zoom Link: https://mit.zoom.us/j/96069673785


Abstract:

Training deep neural networks for computer vision tasks is a time-consuming and expensive process that typically involves collecting and manually annotating a large amount of data for supervised learning. For example, labeling pixelwise-masks for tasks such as instance and semantic segmentation typically requires a human annotator to spend 20-30 seconds per object per image. This is even more problematic when the task demands either high precision (i.e. object boundaries) or expert knowledge (i.e. medical domain).

In this talk, I will present our initial efforts towards overcoming the need for massive manually-labeled datasets. Firstly, I will introduce our work on Cross-Domain Interactive Annotation which constitutes one of the first attempts to use semi-automatic annotation methods, trained on abundant driving images, to cheaply collect labels in unseen domains. Then, I will introduce STEAL, a framework that allows training SoTA boundary detectors from cheap noisy annotations and can be further used to improve coarsely annotated segmentation datasets.  Later, as an alternative to labeling, I will show how we can generate and incorporate cheap synthetic images into the training pipeline by using a technique called Domain Randomization and careful finetuning.  Finally, I will present the Neural Data Server, a search engine that indexes several popular datasets and finds the most useful transfer learning data to the target domain and task.

 


Bio: 

David is a second-year PhD student in the Machine Learning Group at the University of Toronto supervised by Prof. Sanja Fidler. He is also affiliated with the Vector Institute for Artificial Intelligence and a Research Scientist at Nvidia. His research is focused on learning efficient representations that transfer across multiple domains and help to overcome the need for massive manually-labeled datasets. To this end, he is also interested in developing new learning algorithms and neural architectures that lead to better generalization. He has co-authored more than 10 publications in top-tier conferences including Polygon-RNN++, STEAL, GSCNN, the Neural Data Server, and one of the first attempts to bridge the gap between real and synthetic images for computer vision tasks via domain randomization.

Training deep neural networks for computer vision tasks is a time-consuming and expensive process that typically involves collecting and manually annotating a large amount of data for supervised learning. For example, labeling pixelwise-masks for tasks such as instance and semantic segmentation typically requires a human annotator to spend 20-30 seconds per object per image. This is even more problematic when the task demands either high precision (i.e. object boundaries) or expert knowledge (i.e. medical domain).




Title: Challenges in Video Understanding: Supervision, Biases and Temporal Reasoning

Speaker: Rohit Girdhar

Time: Tuesday, April 14th. 12pm EST


Abstract: Video understanding has evolved significantly in the last few years. The datasets have gotten larger, tasks more complex, and models deeper and spatio-temporal. In this talk, I will take a closer look at these and argue that in spite of the improvements, we are still a long way from human-level video understanding capability. Our datasets and tasks are still beset with biases that stifle the development of truly spatio-temporal visual reasoning architectures. Moreover, our models are supervision-hungry, and there is a need to build systems that can learn from few samples, similar to how humans. Finally, I will discuss my recent work that proposes some initial solutions to remedy this situation, through data efficient approaches and by building benchmarks that by design require temporal reasoning.

Bio: Rohit is a research scientist at Facebook AI Research, working on computer vision and machine learning. His current research focuses on modeling temporal dynamics, with applications including video understanding and physical reasoning. He obtained a PhD from Carnegie Mellon University advised by Deva Ramanan. His research has won multiple international challenges, as well as a Best Paper Award at ICCV’19 Workshop, Siebel Scholarship at CMU, and a Gold Medal and Research Award for undergraduate research at IIIT Hyderabad. He has also spent time at DeepMind, Facebook and Adobe through research internships.




Title: On Efficiently Modeling Videos

Speaker: Chao-Yuan Wu

Time: Tuesday, April 7th. 12pm.

Abstract:

Today's computer vision systems accurately parse scenes, recognize objects, reason about their shape, composition, etc., all on static images. However, our world is not static; it changes and evolves over time. To truly understand the visual world, one must go beyond snapshots, and reason about it in its full spatial-temporal extent. Videos carry this information. Unfortunately, today's video models are lagging behind in at least three ways: they are slow, inaccurate, and wasteful.


We show that a central cause of all these issues is the inefficiency in current systems. They are inefficient to store, inefficient to model, inefficient to train, and exhibit poor data efficiency. In our recent work, we redesign the whole video pipeline: from recognition, compression/decompression to training and model design. We will first present video recognition from compression, which exploits compression to remove superfluous information by up to two orders of magnitude for efficient recognition. We will then introduce video compression through recognition, which leverages recent advances in recognition systems to improve compression of visual patterns. To support understanding of long-form videos (e.g. movies), we will also present a scalable solution through an augmented feature-bank design. Finally, we introduce our latest work on an efficient training method to reduce the high economic and environmental costs of video model training.



Bio:

Chao-Yuan Wu is a PhD student at UT Austin, working with Philipp Krähenbühl. Chao-Yuan's research interests are mainly in computer vision. His current research focuses on video understanding. Chao-Yuan holds an MS degree in ML from CMU and a BS degree in EE from NTU. He has also worked with Ross Girshick, Kaiming He, and Christoph Feichtenhofer at FAIR and Alex Smola at Amazon during his PhD.





Title: Visual Learning with fewer labels

Speaker: Bharath Hariharan

Time: Tuesday March 31st, 12pm EST


Abstract: Humans can learn new visual concepts from a few examples, but machines require hundreds or thousands of training images even with state-of-the-art recognition machinery. The key challenge here is the large intra-class variation in many visual classes, which cannot be captured by a few images alone. To address this challenge, the recognition system must leverage its past visual experience with similar classes that share modes of intra-class variation.

 

While this intuition seems straightforward, it is unclear exactly what the system must transfer from previously seen classes to a novel class with a few examples. In this talk I will describe work from my group on answering this question. I will show results using machine learning to effectively learn to hallucinate additional examples for rare classes by transferring modes of variation learnt on common classes. I will then discuss an alternative route, by explicitly incorporating domain knowledge of objects and parts. I will show how this latter strategy can provide large gains in “few-shot” accuracy with few additional parameters. These results suggest that unlike large-scale recognition, black-box models that ignore domain knowledge may not be the way forward in the low-supervision regime.

 

Bio: I am an assistant professor in computer science at Cornell. Before joining Cornell, I was a postdoctoral scholar at Facebook AI Research. I finished my PhD with Jitendra Malik at University of California, Berkeley. My interests are in computer vision in general and visual recognition in particular. My current focus is on reducing the large data requirements of modern vision techniques.






Title: Identifying Bias in Dataset Replication

Speaker: Andrew Ilyas

Time: Tuesday, March 24th. 12pm-1pm  


Abstract: Dataset replication can be a powerful tool for assessing the   extent to which state-of-the-art models have "overfit" to a finite-sample test set, or to meaningless artifacts of the data distribution. In this talk, we identify statistical bias as a somewhat unintuitive source of error in dataset replication. Using the ImageNet-v2 dataset as a case study, we discuss several techniques for mitigating the effects of this bias. 


Bio: Andrew Ilyas is a PhD student at MIT, advised by Aleksander Madry and Costis Daskalakis. His interests lie in quantifying and improving the robustness of machine learning systems.




Title: Neuro-Symbolic Frameworks for Visually Concept Learning and Language Acquisition

Speaker: Jiayuan Mao

Time: Thursday, March 19th. 2pm

Abstract: Humans are capable of learning visual concepts by jointly understanding vision and language. Imagine that someone with no prior knowledge of colors is presented with the images of the red and green objects, paired with descriptions. They can easily identify the difference in objects’ visual appearance (in this case, color), and align it to the corresponding words. This intuition motivates the use of image-text pairs to facilitate automated visual concept learning and language acquisition.


In the talk, I will present recent progress on neuro-symbolic models for visual concept learning, reasoning, and language acquisition. These models learn visual concepts and their association with symbolic representations of language and unravel syntactic structures, as well as compositional semantics of sentences, only by looking at images and reading paired natural language texts. No explicit supervision, such as class labels for objects or parse trees, is needed. I will also discuss their extensions to syntactic bootstrapping, metaconcept reasoning, action grounding, and robotic planning.


Bio: Jiayuan Mao is a Ph.D. student at MIT, advised by Professors Josh Tenenbaum and Leslie Kaelbling. Mao's research focuses on structured knowledge representations that can be transferred among tasks and inductive biases that improve the learning efficiency and generalization. Representative research topics are concept learning, neuro-symbolic reasoning, scene understanding, language acquisition, and robotic planning.




Title: Neurosymbolic 3D Models: Hybrid Neural-Procedural 3D Shape Synthesis

Speaker: Daniel Ritchie

Place: 32-D507

Date: Tuesday, March 10th, 12pm


Abstract:

Generative models of 3D shapes promise compelling possibilities: the elimination of tedious manual 3D modeling in creative practices, powerful priors for autonomous vision and 3D reconstruction, and more. Procedural representations (such as probabilistic grammars or programs) are one such possibility: they offer high-quality and editable outputs but are difficult to author and often result in limited diversity among the output shapes. On the other extreme are deep generative models: given enough data, they can learn to generate any class of shapes, but their outputs exhibit artifacts and their representation is not easily interpretable or editable. In this talk, I’ll discuss my ongoing research agenda toward achieving the best of both worlds: neurosymbolic 3D models, i.e. a hybrid neural-procedural approach to 3D shape synthesis. The talk will focus largely on recent work toward designing a low-level “assembly language” for 3D shapes, as well as deep generative models which can learn to author programs in this language.


Bio:

Daniel Ritchie is an Assistant Professor Computer Science at Brown University, where he co-leads the Visual Computing group. His research sits at the intersection of computer graphics and machine learning: broadly speaking, he is interested in helping machines to understand the visual world, so that they can in turn help people to be more visually expressive. His group’s current work focuses on data-driven methods for analyzing and synthesizing 3D scenes and the 3D objects that comprise them. He received his PhD from Stanford University and his undergraduate degree from UC Berkeley, both in Computer Science.




Title: Real world bottlenecks in building machine learning systems

Speaker: Nataniel Gutierrez

Place: 32-D507

Date: Tuesday, February 25th, 12pm. 


In this talk, he will be exploring two areas of research:


Learning To Simulate

Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. We propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications. We also discuss potential implications of this research.


Behavior Understanding using Facial Analysis

The advent of deep learning has allowed the community to achieve great performance in facial analysis tasks such as keypoint estimation, head pose estimation and action unit detection. We believe human behavior understanding is an important and immediate frontier in computer vision research. We describe recent developments in our research on human behavior understanding and behavior prediction, which build upon the recent successes in facial analysis. In particular, we describe work on estimating attention of human beings by jointly modeling gaze and scene saliency. We also describe recent work in behavior prediction of students working with an intelligent tutoring system by leveraging affect transfer learning.









Title: A primer on normalizing flows

Speaker: Laurent Dinh

Place: 32-D507

Date: February 20th, 12pm. 


Abstract: Abstract: Normalizing flows are an flexible family of probability distributions that can serve as generative models for a variety of data modalities. Because flows can be expressed as compositions of expressive functions, they have successfully harnessed recent advances in deep learning. An ongoing challenge in developing these methods is the definition of expressive yet tractable building blocks. In this talk, I will introduce the fundamentals and describe recent work (including my own) on this topic.

Bio: I am a research scientist at Google Brain (Montréal, Canada). The focus of my work is currently deep generative models, probabilistic modeling, generalization, and optimization.

I obtained my PhD in deep learning at Mila (Montréal, Canada), under the supervision of Yoshua Bengio. Prior to that I was studying at ECP (Paris, France) in applied mathematics and at ENS Cachan in machine learning and computer vision. I had the privilege to work as an intern in the machine learning group led by Nando de Freitas both at UBC (Vancouver, Canada) and DeepMind (London, United Kingdom), and also at Google Brain (Mountain View, US), under the supervision of Samy Bengio.






Title: Real world bottlenecks in building machine learning systems

Speaker: Swami Sankaranarayanan

Place: 32-D507

Date: February 11th, 12pm. 


Abstract: A fundamental problem in representation learning is to learn features of interest that are robust and invariant to several nuisance factors. Over the recent past, for a variety of tasks involving images, learned representations using deep networks have empirically been shown to outperform handcrafted representations. However, their inability to generalize across data distributions poses the following question: Do representations learned using deep networks just fit a given distribution or do they sufficiently model the underlying structure of the problem ? In my research I have focused on topics related to the above aspects including adversarial learning, dataset shift and distribution shift. In this talk, I will present some ideas from our recent research involving training and evaluating machine learning models over noisy labeled data. We show that explicitly modeling annotation confusion not only achieves robustness to a high degree of label noise but is also extremely useful in ranking annotators. Along with benchmark results, I will also highlight some results of our approach in real world medical image classification tasks. This will be followed by a brief discussion of our most recent work regarding evaluating learned models even in cases where human experts disagree about the underlying truth. I will conclude with a highlight of my work at a cutting edge medical imaging startup, developing and deploying real world safety critical deep learning models, along with my directions for future research.


Bio: I am a Machine Learning Researcher at a healthcare startup, Butterfly Network, where I have been involved in developing and deploying real machine learning systems related to healthcare. Previously, I obtained my PhD degree in University of Maryland, College Park supervised by Prof. Rama Chellappa, where I was given a department-wide for best dissertation (Top 5%). In my PhD research I have worked on several topics in machine learning such as metric learning, adversarial training and domain adaptation. I have applied these techniques to design robust computer vision systems. Before joining my PhD, I spent two lovely years in Europe, obtaining my Masters from TU Delft, Netherlands and INRIA Sophia Antipolis, France. My master thesis was focused on evaluating people tracking systems. I am originally from India where I obtained my undergraduate degree in Electronics Engineering with a focus on Wireless Communication Systems. My hobbies include hiking/cycling, photography and carpentry.





Title: Data-driven Computational Studio

Speaker: Aayush Bansal

Place: 32-D507

Date: February 4th, 12pm. 



Abstract: Humans have a remarkable ability to associate different concepts and create visual worlds far beyond what can be seen by a human eye. These capabilities include (and are not limited to) inferring the state of the unobserved, imagining the unknown, and thinking of diverse possibilities about what lies in the future. Humans, however, struggle to share their rich mental imagery with others precisely. A computational machinery that can understand and create a four-dimensional audio-visual world can enable humans to share their vision with others better than a description via words. My research aims at building the Computational Studio that allows average humans to express themselves using everyday computational devices audio-visually.


My research on the Computational Studio enables the creation of content that spans across images to audio-video to 3D models to 4D space-time visualization of dynamic events. In this talk, I will stress on three essential aspects of audio-visual content creation. In the first part, I will demonstrate the importance of spatial-temporal information using examples from 4D space-time visualization of dynamic events and video retargeting. I will then highlight the importance of thinking about an average human user and the necessity of thinking about computational devices. Finally, I will demonstrate the importance of multi-modal information via audio-visual synthesis.


Bio: Aayush Bansal is a Ph.D. candidate at the Robotics Institute of Carnegie Mellon University. At CMU, he collaborates with Prof Deva Ramanan, Prof Yaser Sheikh, and Prof Srinivas Narasimhan. He is a Presidential Fellow at CMU, and a recipient of Uber Presidential Fellowship (2016-17), Qualcomm Fellowship (2017-18), and Snap Fellowship (2019-20). Various national and international media such as NBC, CBS, France TV, and The Journalist have extensively covered his work. More details are here: http://www.cs.cmu.edu/~aayushb




Title: Learning to Learn More with Less


Abstract:

Understanding how humans and machines learn from few examples remains a fundamental challenge. Humans are remarkably able to grasp a new concept from just few examples, or learn a new skill from just few trials. By contrast, state-of-the-art machine learning techniques typically require thousands of training examples and often break down if the training sample set is too small.


In this talk, I will discuss our efforts towards endowing visual learning systems with few-shot learning ability. Our key insight is that the visual world is well structured and highly predictable not only in feature spaces but also in under-explored model and data spaces. Such structures and regularities enable the systems to learn how to learn new tasks rapidly by reusing previous experiences. I will focus on a few topics to demonstrate how to leverage this idea of learning to learn, or meta-learning, to address a broad range of few-shot learning tasks: meta-learning in model space and task-oriented generative modeling. I will also discuss some ongoing work towards building machines that are able to operate in highly dynamic and open environments, making intelligent and independent decisions based on insufficient information.


Bio: Yuxiong Wang is a postdoctoral fellow in the Robotics Institute at Carnegie Mellon University. He received a Ph.D. in robotics in 2018 from Carnegie Mellon University. His research interests lie in the intersection of computer vision, machine learning, and robotics, with a particular focus on few-shot learning and meta-learning. He has spent time at Facebook AI Research (FAIR).




Embodied Visual Recognition

Speaker: Hsiao-Yu (Fish) Tung 

Time: Tuesday Nov 5. 12pm

Location: 32-D507


Abstract: 

Current state-of-the-art CNNs can localize and name objects in internet photos, yet, they miss the basic knowledge that a two-year-old toddler has possessed: objects persist over time despite changes in the camera view, they have 3D extent, they do not 3D intersect, and so on. In this talk, I will introduce neural architectures that learn to parse video streams of a static scene into world-centric 3D feature maps by disentangling camera motion from scene appearance.  I will show the proposed architectures learn object permanence, can generate RGB views from novel viewpoints in truly novel scenes, can infer affordability in sentences by grounding language in 3D visual simulations, and can learn intuitive physics in a persistent 3D feature space. Our experiments suggest that the proposed architecture is essential to generalize across objects and locations, and it overcomes many limitations of 2D CNNs.


Bio: 

Hsiao-Yu (Fish) Tung is a fifth-year PhD student in the Machine Learning Department at CMU, advised by Professor Katerina Fragkiadaki. She is interested in building machines that can understand and interact with the world. Her research spans across unsupervised learning, computer vision, graphics, robotics, and language. She is selected for the 2019 Rising Stars in EECS program. Her research is supported by the Yahoo InMind fellowship.

She received her M.S. in CMU MLD and B.S. in Electrical Engineering from National Taiwan University. During her master degree, she worked with Professor Alex Smola on spectral method for Bayesian models and had designed efficient and provable algorithms for unsupervised topic discovery.






Title: Deep Neural Networks for 3D Processing and High-Dimensional Filtering

Speaker: Hang Su

Time: Tuesday Oct 22. 12pm

Location: 32-D507


Abstract: 

Deep neural networks (DNN) have seen tremendous success in the past few years, advancing the state of the art in many AI areas by significant margins. Part of the success can be attributed to the wide adoption of convolutional filters. These filters can effectively capture the invariance in data, leading to easier training and more compact representations, and at the same can leverage extremely efficient implementations on modern hardware. Since convolution operates on regularly structured grids, it is a particularly good fit for texts and images where there are inherent 1D or 2D structures. However, extending DNNs to 3D or higher-dimensional spaces is non-trivial, because data in such spaces often lack regular structure and the curse of dimensionality can also adversely impact performance in multiple ways. 


In this talk, we present several new types of neural network operations and architectures for 3D and higher-dimensional spaces and demonstrate how we can mitigate these issues while retaining the favorable properties of standard convolution. First, we investigate view-based representations for 3D shape recognition. We show that a collection of 2D views can be highly informative, and we can adapt standard 2D DNNs with a simple pooling strategy to recognize objects based on their appearances from multiple viewpoints with unprecedented accuracy. Next, we make a connection between 3D point cloud processing and sparse high-dimensional filtering. The resulting representation is highly efficient and flexible, and allows native 3D operations as well as joint 2D-3D reasoning. Finally, we show that high-dimensional filtering is also a powerful tool for content-adaptive image filtering and demonstrate different scenarios where DNNs can incorporate such operations for computer vision applications, including joint upsampling and semantic segmentation.


Bio:

Hang Su is a PhD student in the Computer Vision Lab at UMass Amherst, advised by Prof. Erik Learned-Miller and Prof. Subhransu Maji. He works in the areas of computer vision and graphics with a focus to bring together the strengths of 2D and 3D visual information for learning richer and more flexible representations. He obtained his master's degree from Brown University and his bachelor's degree from Peking University. During his studies, he enjoyed internships at Nvidia Research, Microsoft Research, eHarmony, and the Chinese Academy of Sciences. He is a recipient of a CVPR Best Paper Honorable Mention Award and an NVAIL Pioneering Research Award. 




Title: Emergence of Structure in Neural Network Learning

Speaker: Brian Cheung

Time: 10/10/19 - 10am

Location: 32-D507 


Abstract:

Learning is one of the hallmarks of human intelligence. It marks a level of flexibility and adaptation to new information that no artificial model has achieved at this point. This remarkable ability to learn makes it possible to accomplish a multitude of cognitive tasks without requiring a multitude of information from any single task. We describe emergent phenomena that occur during learning for neural network models. First, we observe how learning well-defined tasks can lead to the emergence of structured representations. This emergent structure appears at multiple levels within these models. From semantic factors of variation appearing in the hidden units of an autoencoder to physical structure appearing at the sensory input of an attention model, learning appears to influence all parts of a model. With this in mind, we develop a new method to guide this learning process for acquiring multiple tasks within a single model. Such methods will endow neural networks with greater flexibility to adapt to new environments without sacrificing the emergent structures which have been acquired previously from learning.


Bio:

Brian Cheung is a visiting postdoctoral researcher at MIT hosted by Pulkit Agrawal. He recently received his PhD from UC Berkeley while working at the Redwood Center for Theoretical Neuroscience with Bruno Olshausen. His interests center on developing learning algorithms that enable machines to adapt and behave more like humans. His current work focuses on developing learning algorithms that can adapt continually changing environments.




Date: Friday 10/04. 4.30pm - 5.30pm

--------------------------

Title: People Watching

Speaker: Jitendra Malik


Abstract: 

In this mini-talk I will present recent results on some classic problems in perceiving people – predicting 3D human dynamics, understanding conversational gesture accompanying speech and classifying human actions in video clips. We can even use imitation learning as a basis for skill acquisition from video. However much remains to be done, and I will list some of the key open problems.


Bio:

Jitendra Malik is Arthur J. Chick Professor in the Department of Electrical Engineering and Computer Science at the University of California at Berkeley, where he also holds appointments in vision science, cognitive science and Bioengineering. He received the PhD degree in Computer Science from Stanford University in 1985 following which he joined UC Berkeley as a faculty member. He served as Chair of the Computer Science Division during 2002-2006, and of the Department of EECS during 2004-2006.

Jitendra's group has worked on computer vision, computational modeling of biological vision, computer graphics and machine learning. Several well-known concepts and algorithms arose in this work, such as anisotropic diffusion, normalized cuts, high dynamic range imaging and shape contexts. He was awarded the Longuet-Higgins Award for “A Contribution that has Stood the Test of Time” twice, in 2007 and 2008, received the PAMI Distinguished Researcher Award in computer vision in 2013 the K.S. Fu prize in 2014, and the IEEE PAMI Helmholtz prize for two different papers in 2015.


Jitendra Malik is a Fellow of the IEEE, ACM, and the American Academy of Arts and Sciences, and a member of the National Academy of Sciences and the National Academy of Engineering.



--------------------------

Title: Overcoming Mode Collapse and the Curse of Dimensionality

Speaker: Ke Li


Abstract:

In this talk, I will present our work on overcoming two long-standing problems in machine learning and algorithms:


1. Mode collapse in generative adversarial nets (GANs)


Generative adversarial nets (GANs) are perhaps the most popular class of generative models in use today. Unfortunately, they suffer from the well-documented problem of mode collapse, which the many successive variants of GANs have failed to overcome. I will illustrate why mode collapse happens fundamentally and show a simple way to overcome it, which is the basis of a new method known as Implicit Maximum Likelihood Estimation (IMLE). 


2. Curse of dimensionality in exact nearest neighbour search


Efficient algorithms for exact nearest neighbour search developed over the past 40 years do not work in high (intrinsic) dimensions, due to the curse of dimensionality. It turns out that this problem is not insurmountable - I will explain how the curse of dimensionality arises and show a simple way to overcome it, which is the gives rise to a new family of algorithms known as Dynamic Continuous Indexing (DCI). 


Bio: 

Ke Li is a recent Ph.D. graduate from UC Berkeley, where he was advised by Prof. Jitendra Malik, and will join Google as a Research Scientist and the Institute for Advanced Study (IAS) as a Member hosted by Prof. Sanjeev Arora. He is interested in a broad range of topics in machine learning, computer vision, NLP and algorithms and has worked on generative modelling, nearest neighbour search and Learning to Optimize. He is particularly passionate about tackling long-standing fundamental problems that cannot be tackled with a straightforward application of conventional techniques. He received his Hon. B.Sc. in Computer Science from the University of Toronto in 2014. 





Toward Holistic Scene Understanding: Integrating Study of Low-level and High-level Vision Problems

Speaker: Huaizu Jiang

Time: Tuesday, September 24th, 12pm

Location: 32-D507


Abstract:

When we perceive our visual world, our human beings own an amazing ability to instantly infer a set of properties of the scene: ranging from low-level properties of motion, depth, and occlusion, to high-level ones including objects' locations, semantic labels, etc. In this talk, I'll introduce our recent two papers of visual scene understanding that integrate study of low-level and high-level vision problems.


In the first part, I'll introduce a self-supervised learning approach that provides an alternative to ImageNet pre-training to learn useful feature representations. In specific, we take two consecutive video frames and estimate optical flow between them, based on which we recover camera motion and thus relative depth of the scene. The estimated relative depth is used as supervision to train a network using the first video frame only. The learned feature representations can be transferred to downstream tasks like semantic segmentation and object detection, leading to better performance than training from scratch. I'll also show that such a pre-training can be used as a way of dealing with domain adaptation. 


In the second part, I'll present an end-to-end trainable network for jointly estimating optical flow, stereo disparity, occlusions, and semantic segmentation of a scene. We design an efficient and compact model with a single shared encoder and keep different decoders for different tasks. Since not all ground-truth annotations are available at the same time for different tasks, we also design semi-supervised loss terms, including distillation loss and self-supervised loss, to utilize partially labeled data. Based on estimated optical flow, stereo disparity, and semantic segmentation, we can estimate 3D motion of the scene (known as scene flow). 


Bio:

Huaizu Jiang is fifth-year PhD student of UMass Amherst, advised by Erik Learned-Miller. He has broad interest in computer vision, computational photography, natural language processing, and machine learning. His long-term research goal is to advance visual intelligence by utilizing massive unlabeled visual data, mining knowledge written in form of text, and developing an intelligent agent from its interactions with the environment and other peers. Besides, he is also passionate about building smart tools to help people more easily record and share life experiences. He received Adobe Fellowship and NVIDIA Graduate Fellowship in 2019.



Hybrid Bayesian Eigenobjects - Toward Unified 3D Robot Perception

Speaker: Ben Burchfiel

Date: Tuesday, March 24, 2019. 12-1pm

Location: 32-D507


Abstract:

Hybrid Bayesian Eigenobjects are a novel representation for 3D objects that leverage both convolutional (deep) inference and linear subspace methods to enable robust reasoning about novel 3D objects. HBEOs allow joint estimation of the pose, class, and full 3D geometry of a novel object observed from a single (depth-image) viewpoint in a unified practical framework. By combining both linear subspace methods and deep convolutional prediction, HBEOs offer improved runtime, data efficiency, and performance compared to preceding purely deep or purely linear methods. In this talk, I discuss the current state of 3D object perception, HBEOs (and their predecessor BEOs), and the path forward towards reliable perception in cluttered and fully unstructured environments.


BIO:

Benjamin Burchfiel is a 6th year PhD candidate at Duke University as part of the Intelligent Robot Lab (IRL) supervised by Professor George Kondiaris. Benjamin's primary area of research lies in the intersection of computer vision and robotics; his thesis work is a unified framework to enable more robust object-centric robot perception and reasoning. In addition to robot perception, Benjamin also has interest and research experience in reinforcement learning, learning from demonstration, skill transfer, and natural language understanding. Benjamin received his BSc in Computer Science from the University from Wisconsin-Madison and his MSc in Computer Science from Duke University. Benjamin's webpage can be found at benburchfiel.com.




Human-Centered Autonomous Vehicles

Speaker: Lex Fridman

Date: Tuesday, Nov. 6 2018. 

Time: 12-1pm

Location: 32-D507


Abstract:

I will present a human-centered paradigm for building autonomous vehicle systems, contrasting it with how the problem is currently formulated and approached in academia and industry. The talk will include discussion and video demonstration of new work on driver state sensing, voice-based transfer of control, annotation of large-scale naturalistic driving data, and the challenges of building and testing a human-centered autonomous vehicle at MIT.

Bio:

Lex Fridman is a research scientist at MIT, working on deep learning approaches to perception, control, and planning in the context of semi-autonomous vehicles and more generally human-centered artificial intelligence systems. His work focuses on learning-based methods that leverage large-scale, real-world data. Lex received his BS, MS, and PhD from Drexel University where he worked on applications of machine learning, computer vision, and decision fusion techniques in a number of fields including robotics, active authentication, and activity recognition. Before joining MIT, Lex was at Google leading deep learning efforts for large-scale behavior-based authentication. Lex is a recipient of a CHI-17 best paper award and a CHI-18 best paper honorable mention award. 



Visual Question Answering and Beyond

Speaker: Aishwarya Agrawal

Date: Thursday, Nov. 1, 2018

Time: 12-1pm

Location: 32-D507

Abstract:

In this talk, I will present our work on Visual Question Answering (VQA) -- I will provide a brief overview of the VQA task, dataset and baseline models, highlight some of the problems with existing VQA models, and talk about our works on fixing some of these problems by proposing -- 1) a new evaluation protocol, 2) a new model architecture, and 3) a novel objective function.

Towards the end of the talk, I will also present some very recent work towards building agents that can generate diverse programs for scenes when conditioned on instructions and trained using reinforced adversarial learning. 

Bio:

Aishwarya Agrawal is a fifth year Ph.D. student in the School of Interactive Computing at Georgia Tech, working with Dhruv Batra and Devi Parikh. Her research interests lie at the intersection of computer vision, machine learning and natural language processing. The Visual Question Answering (VQA) work by Aishwarya and her colleagues has witnessed tremendous interest in a short period of time (3 years). Aishwarya is a recipient of the NVIDIA Graduate Fellowship 2018-2019, she is one of the Rising Stars in EECS 2018, a finalist of the Foley Scholars Award 2018 and Microsoft and Adobe Research Fellowships 2017-2018. As a research intern Aishwarya has spent time at Google DeepMind, Microsoft Research and Allen Institute for Artificial Intelligence. Aishwarya received her bachelor's degree in Electrical Engineering with a minor in Computer Science and Engineering from Indian Institute of Technology (IIT) Gandhinagar in 2014.


Perceiving and imitating 3D humans in the wild

Speaker: Angjoo Kanazawa

Date: Tuesday, October 30, 2018

Time: 12-1pm

Location: 32-D507


Abstract: 

Perceiving humans in 3D from everyday photos and videos has been a long-standing goal in computer vision. Such systems have the potential to enable embodied agents that learn from observing people and marker-less motion capture for AR/VR, entertainment, and medical applications.

In this talk I will discuss challenges in perceiving 3D humans from imagery captured `in-the-wild` obtained from unconstrained everyday photos and videos of people, and discuss our recent approach that uses an adversarial framework to overcome these challenges. Given a single RGB image, our approach, called Human Mesh Recovery (HMR), recovers a 3D mesh of a human body parametrized by pose (3D joint rotations) and shape in real-time. Recovering such a rich 3D representation enables a variety of applications, one of which is imitation learning from internet videos. In particular, I will show how we can extend HMR to train a physically simulated agent to perform a broad range of dynamic skills, such as locomotion, acrobatics, and martial arts from watching YouTube video clips.

In the remaining time, I will discuss 3D reconstruction of objects other than human bodies, where there may be no large-scale collection of 3D data to learn from. 

Bio: 

Angjoo Kanazawa is a postdoctoral researcher at UC Berkeley advised by Jitendra Malik, Alyosha Efros, and Trevor Darrell. Her research is at the intersection of computer vision graphics and machine learning, focusing on 3D reconstruction of deformable shapes such as humans and animals from monocular imagery.  She received her PhD in Computer Science from the University of Maryland College Park where she was advised by David Jacobs, and her Bachelor’s from New York University. She has also closely collaborated with Michael Black at the Max PLanck Institute for Intelligent Systems.  Her work has received the best paper award at Eurographics 2016. 



Beautiful Science

Speaker: Mauro Martino

Time:  11:30 am to 12:30 pm

Date: Tuesday, Nov.7, 2017

Location: 32-D507

Abstract:

Science communication is experiencing a renaissance, thanks to increasingly advanced visualization and interaction techniques. In this talk, we will explore the variety of formats that science communication can take, from freehand drawings of the first scientists to the most modern techniques of “Augmented Reading”. We will see explore of the most well-known experiments of “Data-Movie”.

Bio:

Mauro Martino is an Italian expert in data visualization based in Boston. He created and leads the Visual AI Lab at IBM Watson in Cambridge, Massachusetts. Martino's data visualizations are publish in the scientific journals Nature, Science, and PNAS. His projects have been shown at international festivals including Ars Electronica, and Art Galleries including The Serpentine Gallery (London), GAFTA (San Francisco), Lincoln Center (New York). He is the winner for the best scientific video of 2017 at Nacional Science Foundation Viz competition. His website is at www.mamartino.com


Talk: Learning a Driving Model from Imperfect Demonstrations

Speaker: Huazhe (Harry) Xu

Time: 11:30 am to 12:30 pm

Date: Tuesday, Oct. 31, 2017

Robust real-world learning should benefit from both demonstrations and interaction with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on reward from the environment. These tasks have divergent losses which are difficult to jointly optimize; further, such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstration and refines the policy in a real environment. Crucially, both learning from demonstration and interactive refinement use exactly the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data, since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.

Bio: 

Huazhe (Harry) Xu is a Ph.D. student under Prof. Trevor Darrell in Berkeley Artificial Intelligence Research Lab (BAIR) at University of California, Berkeley (UC Berkeley). He received the B.S.E. degree in Electrical Engineering from Tsinghua University in 2016. His research focuses on computer vision, reinforcement learning and their applications such as autonomous driving.


Title: Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks

Speaker: Amir Arsalan Soltani 

Time: 11:30 am to 12:30 pm

Date: Tuesday, Oct.17, 2017

Agents can take advantage of explicit 3D representations for object manipulation and to handle challenging vision tasks such as learning about the relationships of objects and how to interact with them much more efficiently. Building computational models which are able to obtain 3D representations in an efficient manner and generate 3D shapes with high resolution and details is a first step towards this goal. However, there are important technical challenges to address so that such 3D representations can be effectively deployed as perception systems of artificially intelligent systems. In this talk I will present ourrecently published work on building generative models of generic 3D shapes via multi-view representations. Our work goes beyond the state-of-the-art in keys respects. I will show our results on generating novel shapes randomly and class-conditionally, obtaining 3D reconstructions given a 2D view of an object in real-world settings and analysis of the generated shapes and reconstructions. I will end my talk discussing the role of 3D representations for robotics and object manipulation and will try to establish the connections of using good generative models for 3D shapes and 3D representations in particular in solving inverse problems in vision and planning.

Bio:

Amir is currently a research assistant at Josh Tenenbaum's lab. He obtained his M.Sc. in Computer Science from University at Buffalo in 2016 and his B.Sc. in Software Engineering from IAUN in Iran in 2012. He is interested in doing research on building computational models for perception to learn inverse models of the environment and enable the AI agents plan for their goals more efficiently.


Computational Perception of Infographics

Speaker: Zoya Bylinskii

Time:  11:30 am to 12:30 pm

Date: Tuesday, September 19, 2017

Location: 32-D507

Abstract: 

The goal of my research is to use computer vision tools to decompose, and provide a computational understanding of, infographics. Infographics (including data visualizations and graphic designs) commonly appear in news media and social networks, textbooks and business meetings. They contain a mix of pictographic, text, and stylistic elements, specifically tailored by human designers to communicate concepts or convey messages. I study how humans remember, attend to, and describe infographics in order to build automatic models that can parse and summarize the visual and textual content. I will present recent work where we crowdsourced human attention online and trained computational models to make real-time predictions about the most important regions in infographics. We use these importance predictions for multiple automatic design applications: thumbnailing and retargeting of designs, and to provide interactive feedback in a graphic design tool. I will also discuss how we are training new models to detect visual elements in infographics representative of specific topics, and how we jointly reason about the visual and text content of infographics to make inferences about the concepts or messages that they convey.

Biography: 

Zoya completed her Hon. B.Sc. in Computer Science and Statistics at the University of Toronto. She is currently a senior Ph.D. student at CSAIL, working with Fredo Durand and Aude Oliva. She actively collaborates with the visualization group of Hanspeter Pfister at Harvard. Zoya interned with Aaron Hertzmann and Bryan Russell at Adobe Research, and was an Adobe Research Fellowship recipient in 2016. Her work lies at the interface of human and computer vision, with applications to design, and human-computer interfaces. More: http://web.mit.edu/zoya/www/research.html


Person Search: A New Research Paradigm

Speaker: Shuang Li from CUHK

Time:  11:00 am to 12:00 pm

Date: Tuesday, September 12, 2017

Location: 32-D463

Abstract:

Automatic person search plays a key role in finding missing people and criminal suspects. However, existing methods are based on manually cropped person images, which are unavailable in the real world. Also, there might be only verbal descriptions of suspects’ appearance in many criminal cases. To improve the practicability of person search in real world applications, we propose two new branches: (i) finding a target person in the gallery of whole scene images and (ii) using natural language description to search people.

In this talk, I will first present a joint pedestrian detection and identification network for person search from whole scene images. An Online Instance Matching (OIM) loss function is proposed to train the network, which is scalable to datasets with numerous identities. Then, I will talk about natural language based person search. A two-stage framework is proposed to solve this problem. The stage-1 network learns to embed textual and visual features with a Cross-Modal Cross-Entropy (CMCE) loss, while stage-2 network refines the matching results with a latent co-attention mechanism. In stage-2, the spatial attention relates each word with corresponding image regions while the latent semantic attention aligns different sentence structures to make the matching results more robust to sentence structure variations. The proposed methods produce the state-of-the-art results for person search.

Bio:

Shuang Li is an M.Phil student at the Chinese University of Hong Kong, advised by Prof. Xiaogang Wang. She works in Multimedia Lab with Prof. Xiaoou Tang. Her research interests include computer vision, natural language processing, and deep learning, especially image-text relationship and person re-identification. She was a research intern at Disney Research, Pittsburgh.


Title: Semantic 3D Reconstruction of Urban Scenes

Speaker: Maros Blaha from ETH Zurich

Time:  11:00 am to 12:00 pm

Date: Tuesday, June 13, 2017

Location: 32-D507

Abstract:

Virtual models of human habitats play a key role in daily routines, as witnessed by many influential applications such as navigation and internet cartography. To this day, the generation of urban 3D models often entails human interaction, which is costly and time-consuming. In this talk, I will give an overview of our latest research on automatic generation of semantically annotated 3D city models from image collections. In this context, our goal is to recover the geometry of an observed scene while at the same time also interpreting the scene in terms of semantic object classes (e.g., buildings, vegetation etc.) - similar to a human operator, who also interprets the image content while making measurements. The advantage of jointly reasoning about shape and object class is that one can exploit class-specific a-priori knowledge about the geometry: on the one hand the type of object provides information about its shape, e.g. walls are likely to be vertical, whereas streets are not; on the other hand, 3D geometry is also an important cue for classification, e.g. in our example vertical surfaces are more likely to be walls than streets. In our research, we address this cross-fertilization by developing methods which jointly infer 3D shapes and semantic classes, leading to superior, interpreted 3D city models which allow for realistic applications and advanced reasoning tasks.

Bio:

Maros Blaha is a PhD student and research assistant at the Photogrammetry and Remote Sensing Lab of ETH Zurich, advised by Konrad Schindler. Within his PhD studies, his was visiting scholar in John Fisher's group at CSAIL, MIT. Previously, he obtained his MSc (Master of Science) and BSc (Bachelor of Science) in Geomatics Engineering and Planning at ETH Zurich. During his MSc, he spent one year in industry, at Hexagon/Leica Geosystems. His research interests include joint 3D modeling and scene understanding, as well as 3D reasoning.


Title: A picture of the energy landscape of deep neural networks

Speaker: Pratik Chaudhari from UCLA

Time:  4:00 pm to 5:00 pm

Date: Monday, June 12, 2017

Location: 32-D507

Abstract: 

Stochastic gradient descent (SGD) is the gold standard of optimization in deep learning. It does not, however, exploit the

special structure and geometry of the loss functions we wish to optimize, viz. those of deep neural networks. In this talk, we will focus on the geometry of the energy landscape at local minima with an aim of understanding the generalization properties of deep networks.

In practice, optima discovered by SGD have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We will first leverage upon this observation to construct an algorithm named Entropy-SGD that maximizes a local version of the free energy. Such a loss function favors flat regions of the energy landscape which are robust to perturbations and hence more generalizable, while simultaneously avoiding sharp, poorly-generalizable --- although possibly deep --- valleys. We will discuss connections of this algorithm with belief propagation and robust ensemble learning. Furthermore, we will establish a tight connection between such non-convex optimization algorithms and nonlinear partial differential equations. Empirical validation on CNNs and RNNs shows that Entropy-SGD and related algorithms compare favorably to state-of-the-art techniques in terms of both generalization error and training time.

arXiv: https://arxiv.org/abs/1611.01838, https://arxiv.org/abs/1704.04932

Bio: 

Pratik Chaudhari is a PhD candidate in Computer Science at UCLA. With his advisor Stefano Soatto, he focuses on optimization algorithms for deep networks. He holds Master's and Engineer's degrees in Aeronautics and Astronautics from MIT where he worked on stochastic estimation and randomized motion planning algorithms for urban autonomous driving with Emilio Frazzoli.


Title: Domain Adaptation: from Manifold Learning to Deep Learning

Speaker: Andreas Savakis from Rochester Institute of Technology

Time:  11:00 am to 12:00 pm

Date: Tuesday, June 6, 2017

Location: 32-D463

Abstract: 

Domain Adaptation (DA) aims to adapt a classification engine from a train (source) dataset to a test (target) dataset.  The goal is to remedy the loss in classification performance due to the dataset bias attributed to variations across test/train datasets. This seminar presents an overview of domain adaptation methods from manifold learning to deep learning.  Popular DA methods on Grassmann manifolds include Geodesic Subspace Sampling (GSS) and Geodesic Flow Kernel (GFK). Grassmann learning facilitates compact characterization by generating linear subspaces and representing them as points on the manifold. I will discuss robust versions of these methods that combine L1-PCA and Grassmann manifolds to improve DA performance across datasets.

Deep domain adaptation has received significant attention recently. I will present a new domain adaptation approach for deep learning that utilizes Adaptive Batch Normalization to produce a common feature-space between domains. Our method then performs label transfer based on subspace alignment and k-means clustering on the feature manifold to transfer labels from the closest source cluster to each target cluster.  The proposed manifold-guided label transfer method produces state-of-the-art results for deep adaptation on digit recognition datasets.

Bio:

Andreas Savakis is Professor of Computer Engineering at Rochester Institute of Technology (RIT) and director of the Real Time Vision and Image Processing Lab. He served as department head of Computer Engineering from 2000 to 2011.  He received the B.S. with Highest Honors and M.S. degrees in Electrical Engineering from Old Dominion University in Virginia, and the Ph.D. in Electrical and Computer Engineering with Mathematics Minor from North Carolina State University in Raleigh NC. He was Senior Research Scientist with the Kodak Research Labs before joining RIT. His research interests include domain adaptation, object tracking, expression and activity recognition, change detection, deep learning and computer vision applications. Prof. Savakis has co-authored over 100 publications and holds 11 U.S. patents.  He received the NYSTAR Technology Transfer Award for Economic Impact in 2006, the IEEE Region 1 Award for Outstanding Teaching in 2011, and the best paper award at the International Symposium on Visual Computing (ISVC) in 2013. He is Associate Editor of the IEEE Transactions on Circuits and Systems for Video Technology and the Journal for Electronic Imaging.  He co-organized the first Int. Workshop on Extreme Imaging (http://extremeimaging.csail.mit.edu/) at ICCV 2015, and is Guest Editor at the IEEE Transactions on Computational Imaging for a Special Issue on Extreme Imaging.  

Title: Learning to synthesize and manipulate natural images 

Speaker: Jun-Yan Zhu, Berkeley. 

Time:  2:00 pm to 3:00 pm 

Date: Friday, May 26 

Location: 32-D463 

Abstract: 

Humans are consumers of visual content. Everyday, people watch videos, play digital games and share photos on social media. But there is still an asymmetry - not that many of us are creators. In this talk, we aim to build machines capable of creating and manipulating natural photographs, and use it as training wheels for visual content creation, with the goal of making people more visually literate. We propose to learn natural image statistics directly from large-scale data. We then define a class of image generation and editing operations, and constrain their output to look realistic according to the learned image statistics. 

I will discuss a few recent projects. First, we propose to directly model the natural image manifold via generative adversarial networks (GANs), and constrain the output of an image editing tool to lie on this manifold. Then, we present a general image-to-image translation framework, “pix2pix”, where a network is trained to map input images (such as user sketches) directly to natural looking results. Finally, we introduce CycleGAN, which learns image-to-image translation models even in the absence of paired training data, and additionally demonstrate its application to bridging the gap between synthetic 3D data and real images. 

Bio: 

Jun-Yan Zhu is a Ph.D. candidate at the Berkeley AI Research (BAIR) Lab, working on computer vision, graphics and machine learning with Professor Alexei A. Efros. He received his B.E. from Tsinghua University in 2012, and was a Ph.D. student at CMU from 2012-13. His research goal is to build machines capable of recreating the visual world. Jun-Yan is currently supported by a Facebook Fellowship. 


Title: Studying detailed image interpretation from the limit of visual recognition: full interpretation of minimal images

Speaker: Guy Ben-Yosef, MIT CSAIL and Center for Brains, Minds, and Machines

Time:  11:00 am to 12:00 pm

Date: Tuesday, May 23, 2017

Location: 32-D463 (star)

Abstract: 

In this talk I'll describe a model for ‘full interpretation’ of object images, namely the ability to identify and localize all semantic features and parts that are recognized by human observers. We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of 'minimal configurations’, namely reduced local regions that are minimal in the sense that further reduction will turn them unrecognizable and uninterpretable. I'll show experimental results of our model, and discuss approaches of interpretation modeling to recognize human activities and interactions. Specifically, I'll describe a recent model for recognizing social interactions in still images, based on detailed interpretation of minimal 'interaction configurations'. Joint work with Liav Assif, Alon Yachin, Daniel Harari, and Shimon Ullman.

Bio:

Guy Ben-Yosef is a postdoctoral associate at MIT CSAIL and MIT Center for Brains, Minds, and Machines. He is interested in human and computer vision, with a focus on computational models for visual perceptual organization, visual recognition, and image interpretation. 


Title: Population Based Medical Image Imputation

Speaker: Adrian Dalca, MIT

Time:  11:00 am to 12:00 pm

Date: Tuesday, May 9, 2017

Location: 32-D463

Abstract: 

We present an algorithm for creating high resolution anatomically plausible medical images consistent with acquired clinical brain MRI scans with large slice spacing. Although large databases of clinical images contain a wealth of information, medical acquisition constraints result in sparse scans that miss much of the anatomy. These characteristics often render computational analysis impractical as standard processing algorithms tend to fail when applied to such images. Highly specialized or application-specific algorithms that explicitly handle sparse slice spacing do not generalize well across problem domains. In contrast, our goal is to enable application of existing algorithms that were originally developed for high resolution research scans to dramatically undersampled scans. We introduce a model that captures fine-scale anatomical similarity across subjects in clinical image collections without the need of external high-resolution scans, and use it to fill in the missing data in scans with large slice spacing. Our experimental results demonstrate that the proposed method outperforms current upsampling methods and promises to facilitate subsequent analysis not previously possible with scans of this quality.


The work will be presented at IPMI: Information Processing and Medical Imaging 2017 as: A.V. Dalca, K.L. Bouman, W.T. Freeman, M.R. Sabuncu, N.S. Rost, P. Golland. Population Based Image Imputation.


Short Bio: 

Adrian is a postdoctoral fellow at Massachusetts General Hospital and Harvard Medical School, with an appointment at MIT. He obtained his PhD from MIT in CSAIL working with Polina Golland.  He is interested in mathematical models and machine learning for medical image analysis, with a focus on characterizing genetic and clinical effects on imaging phenotypes. He is also interested and active in healthcare entrepreneurship and translation of algorithms to the clinic.


Title: Interactive Scene Understanding

Speaker: Roozbeh Mottaghi from Allen Institute for Artificial Intelligence (AI2)

Time:  11:00 am to 12:00 pm

Date: Tuesday, May 2, 2017

Location: 32-D463

Abstract: 

Despite recent progress, AI is still far from understanding the physics of the world, and there is a large gap between the abilities of humans and the state-of-the-art AI methods. In this talk, I will focus on physics-based scene understanding and interactive visual reasoning, which are crucial next steps in computer vision and AI. The first part of the talk will describe our work on understanding preliminary physics from images, and the second part of the talk will be about our recent work on using Reinforcement Learning and Imitation Learning to perform tasks in the challenging AI2-THOR environment.

Bio: 

Roozbeh Mottaghi is a Research Scientist at Allen Institute for Artificial Intelligence (AI2). Prior to joining AI2, he was a post-doctoral researcher at the Computer Science Department at Stanford University. He obtained his PhD in Computer Science in 2013 from UCLA. His research is mainly focused on computer vision and machine learning. 

Further info at: http://www.cs.stanford.edu/~roozbeh/


Time: 11:00 am to 12:00 pm

Date: Tuesday, April 25, 2017

Location: 32-D463 (Star)

Talk 1: Multi-spectral Infrared Detection of Buildings for City-Scale Thermal Efficiency Analysis

Abstract:

Automatic identification of building structures and components (windows, doors) from vehicle-collected infrared imagery is critical for the efficient commercial analysis of territories with millions of homes. To fully replace or reduce human annotation, an automated system must select images with a well-centered and minimally-obstructed building and then sufficiently classify each facade pixel in the thermal image data so that an accurate thermal leak analysis can be completed. In this talk, we will discuss how Essess combines long-wave (thermal), near infrared (night-vision), and 3D Lidar imagery, including classification and regional detection techniques, to identify and analyze suburban homes. Through the commercial deployment of the presented system, Essess has successfully reduced the human role from significant annotation to simple QA, increasing the throughput of our system ten-fold, ultimately getting us closer to our goal of providing a meaningful solution to reduce energy costs and address climate change at scale.

Bio:

Jan Falkowski is Chief Technology Officer at Essess (www.essess.com), an MIT spin-off which employs vehicle-based 3D and infrared imaging to deliver city-scale thermal maps for building, navigation and utility infrastructure sectors. Previously, Jan held senior management and engineering roles at Cambridge-based Vecna Technologies, where he was responsible for the R&D and productization efforts of the company’s first commercial robotics product. Jan received his bachelor’s and master’s degrees at Carnegie Mellon University, where he studied mechanical engineering and robotics.

Title 2: Towards Deep Category-Aware Semantic Edge Detection

Abstract: Boundary and edge cues are highly beneficial in improving a wide variety of vision tasks such as semantic segmentation, object recognition, object detection/proposal generation and stereo. Recently, the problem of edge detection has been revisited and significant progress has been made with deep learning. While classical edge detection is a challenging binary classification problem in itself, detecting category-level semantic edges by nature is an even more challenging problem. In addition, the topic presents a core recognition problem that seems to be somewhat under-represented in the literature. In this talk, I will present some latest progress in this area, and give some views/thoughts towards addressing the problem with an end-to-end deep network.

Bio: 

Zhiding Yu is a final year Ph.D. student with the Department of ECE, Carnegie Mellon University. He graduated from the FENG Bingquan Pilot Class, South China University of Technology in 2008 with a B.Eng. degree, and obtained the M.Phil. degree from the Department of ECE, Hong Kong University of Science and Technology in 2012. His main research interests include deep learning, similarity representation, grouping and structured prediction, with their applications to scene parsing, segmentation, object detection/recognition and autonomous driving. He is a co-author of the best student paper in ISCSLP 2014, and the winner of best paper award in WACV 2015. He was twice the recipient of the HKTIIT Post-Graduate Excellence Scholarships (2010/2012). He did several research interns at Adobe Research, Microsoft Research and Mitsubishi Electric Research Laboratories. His intern work on facial expression recognition at Microsoft Research won the First Runner Up at the EmotiW-SFEW Challenge 2015 and was integrated to the Microsoft Emotion API under Project Oxford. For more information, please visit: http://www.contrib.andrew.cmu.edu/~yzhiding/.


Novel Convex Optimization Algorithms for 3D Reconstruction and Image Segmentation

Speaker: Jan Stuhmer

Time: 11:00 am to 12:00 pm

Date: Tuesday, April 11, 2017

Location: 32-D463 (Star)

Abstract:

First I will present my work on real time 3D reconstruction with a hand held camera. At the time of publication it was one of the first real time capable dense image based 3D reconstruction methods. Based on the mathematical framework used to compute variational optical flow, a novel formulation of the multi-view 3D reconstruction problem is derived. This formulation allows to combine multiple views of the scene by rigorous derivations directly following from the scene and camera geometry. Real time performance is achieved by using convex optimization techniques in a multi resolution approach. In the second part of the talk, I will present an efficient framework that allows to define topological constraints in image segmentation and 3D reconstruction. Specifically, we will see how connectivity can be imposed as a monotonicity constraint along the connected paths of a predefined graph. While this reformulation of connectivity is not as general as the existence of any connected path, the weaker formulation of connectivity can be included as a linear constraint in a convex optimization framework. I will describe an efficient projection scheme onto the feasible set which allows to efficiently solve the resulting convex optimization problem using a proximal algorithm. Furthermore I will show that thresholding a minimizer of the relaxed optimization problem yields a minimizer of the discrete problem. The presented framework brings significant improvements over the state-of- the-art in biomedical image segmentation and dynamic 3D reconstruction from video.

Bio:

Before joining MIT, Jan finished his Ph.D. thesis at the Computer Vision Group of the Technical University of Munich under the supervision of Prof. Daniel Cremers, co-supervised by Prof. Peter Schröder at Caltech. In 2014/2015 he stayed as a research intern with Microsoft Research Cambridge and in 2013 as a visiting student researcher at the Applied Geometry Lab at Caltech. From 2005 to 2009 he was with the group of Carl-Philipp Heisenberg at the Max Planck Institute of Molecular Cell Biology and Genetics. He received his Diploma degree (with distinction) in Computer Science from Dresden University of Technology.

Physical object representations for perception, prediction, and problem solving

Speaker: Ilker Yildirim from MIT BCS

Time:  11:00 am to 12:00 pm

Date: Tuesday, March 28, 2017

Location: 32-D463

Abstract:

From a quick glance, the touch of an object, or a brief sound snippet, our brains construct scenes composed of rich and detailed shapes and surfaces. These representations underlie not only object and scene recognition, but also support mental imagery, problem solving and action planning. I will argue that to compute such rich representations, the brain draws on internal causal and compositional models of the outside physical world. In this view, perception is building and using such internal models to explain its sense inputs. I implement this approach drawing on a range of methods including probabilistic generative models, neural networks, and forward models such as those found in modern video game engines. I will present computational models of unisensory and multisensory perception, physical scene understanding, and physical problem solving, all of which fall from the same principled approach to perception and cognition. I will show that these models solve difficult engineering tasks, account for the tuning properties of individual neurons in the brain, and predict the human behavioral patterns across a range of tasks. Together, these findings provide computational insights to the mind and brain while paving a way for human-like artificially intelligent systems.

Bio:

Ilker Yildirim is a research scientist at MIT. He mainly works with Josh Tenenbaum (MIT) and Winrich Freiwald (The Rockefeller University). Before, he did my Ph.D. studies at University of Rochester advised by Robbie Jacobs. His homepage is http://www.mit.edu/~ilkery/.



Perception for manipulation

Speaker: Peter K.T. Yu from MIT MCube Lab

Time:  11:00 am to 12:00 pm

Date: Tuesday, March 21, 2017

Location: 32-D463

Abstract:

Our team (MIT MCube Lab) participated in Amazon Picking Challenge (APC) in 2015 and 2016, and received second and third place, respectively. The challenge was to develop a fully autonomous system to pick and place products in a warehouse setting, where various products may coexist inside a single bin. Robots are still far from achieving human speed and reliability. In the first half of the talk, I will describe our approach to the APC, list the lessons we have learned, and then propose our wish list regarding technologies that will be useful for developing similar systems in the future.

One major item in the wish list is to exploit physics and contact sensing in the perception system, where vision usually plays a big role but is still insufficient in practice. This can be due to occlusions, inaccurate models, ambiguous appearances, etc. I will spend the second half of the talk on our effort on using pushing mechanics and contact sensing to estimate shape and pose of an object. I will draw the connections between our problem and the SLAM (simultaneous localization and mapping) problem and show how we can apply frameworks developed in the SLAM community. Our results show that incorporating extra information can improve the accuracy from vision alone. Moreover, even when vision cues are temporarily missing, our system can still reason about the object state during manipulation.

Bio:

Peter is a fourth-year PhD student of Electrical Engineering and Computer Science. He received the degrees of B.S. in Computer Science from National Chiao-Tung University in 2010, and M.S. in Computer Science from National Taiwan University in 2012. He is now working with Prof. Alberto Rodriguez, and Prof. John Leonard in the MCube lab. His research focuses on incorporating contact sensing and physics to improve robot perception. His paper on dataset collection for pushing manipulation was nominated for Best Paper Award in IROS 2016.

Designing robots to improve people's lives has always been his career goal. He worked on perception problems in the DARPA Robotics Challenge (2013-2015), and served as perception and software lead in the Amazon Picking Challenge (Second Place 2015, Third Place in 2016). His website: http://people.csail.mit.edu/peterkty/


Video Recognition at Adobe

Speaker: Bryan Russell, Adobe Research

Time:  11:00 am to 12:00 pm

Date: Tuesday, March 7, 2017

Location: 32-D463

Abstract: In this talk I will describe ongoing efforts in video recognition at Adobe.  Video presents additional challenges over recognition in still images.  Example challenges include the sheer volume of data, lack of annotated data across time, and presence of action categories where motion and appearance cues are critical.  In the first part, I will describe ActionVLAD, a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of a video.  The resulting architecture is end-to-end trainable for whole-video classification. We show that our representation outperforms a two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on the public HMDB51 video classification benchmark.  In the second part, I will describe ongoing work to localize moments in video with natural language. 

Bio: Bryan Russell is a Research Scientist at Adobe Research. He received his Ph.D. from MIT in the Computer Science and Artificial Intelligence Laboratory under the supervision of Professors Bill Freeman and Antonio Torralba. He was a post-doctoral fellow in the INRIA Willow team at the Département d'Informatique of Ecole Normale Supérieure in Paris, France. He was a Research Scientist with Intel Labs as part of the Intel Science and Technology Center for Visual Computing (ISTC-VC) and has been affiliated with the University of Washington.


Embodied learning for visual recognition

Speaker: Dinesh Jayaraman, UT Austin

Time: 11:00 am to 12:00 pm

Date: Tuesday, Feb. 28, 2017

Location: 32-D507 (different to previous talks)

Abstract: Visual recognition methods have made great strides in recent years by exploiting large manually curated and labeled datasets specialized to various tasks. My research focuses on asking: could we do better than this painstakingly manually supervised approach? In particular, could embodied visual agents teach themselves through interaction with and experimentation in their environments? 

In this talk, I will present approaches that we have developed to model the learning and performance of visual tasks by agents that have the ability to act and move in their worlds. I will showcase results that indicate that computer vision systems could benefit greatly from action and motion in the world, with continuous self-acquired feedback. In particular, it is possible for embodied visual agents to learn generic image representations from unlabeled video, improve scene and object categorization performance through intelligent exploration, and even learn to direct their cameras to be effective videographers.

Biography: Dinesh Jayaraman is a PhD candidate in Kristen Grauman's group at UT Austin. His research interests are broadly in visual recognition and machine learning. In the last few years, Dinesh has worked on visual learning and active recognition in embodied agents, unsupervised representation learning from unlabeled video, visual attribute prediction, and zero-shot categorization. During his PhD, he has received the Best Application Paper Award at the Asian Conference on Computer Vision 2016 for work on automatic cinematography, the Samsung PhD Fellowship for 2016-17, a UT Austin Microelectronics and Computer Development Fellowship, and a Graduate Dean's Prestigious Fellowship Supplement Award for 2016-17. Before beginning graduate school, Dinesh graduated with a bachelor's degree in electrical engineering from the Indian Institute of Technology Madras (IITM), Chennai, India.


Attention and Activities in First Person Vision

Speaker: Yin Li, Georgia Institute of Technology

Time:11:00 am to 12:00 pm

Date: Tuesday, Feb. 21, 2017

Location: 32-D463 (Start Room)

Abstract:

Advances in sensor miniaturization, low-power computing, and battery life have enabled the first generation of mainstream wearable cameras. Millions of hours of videos have been captured by these devices, creating a record of our daily visual experiences at an unprecedented scale. This has created a major opportunity to develop new capabilities and products based on First Person Vision (FPV)--the automatic analysis of videos captured from wearable cameras. Meanwhile, vision technology is at a tipping point. Major progress has been made over the last few years in both visual recognition and 3D reconstruction. The stage is set for a grand challenge of activity recognition in FPV. My research focuses on understanding naturalistic daily activities of the camera wearer in FPV to advance both computer vision and mobile health. 

In the first part of this talk, I will demonstrate that first person video has the unique property of encoding the intentions and goals of the camera wearer. I will introduce a set of first person visual cues that captures the users' intent and can be used to predict their point of gaze and the actions they are performing during activities of daily living. Our methods are demonstrated using a benchmark dataset that I helped to create. In the second part, I will describe a novel approach to measure children’s social behaviors during naturalistic face-to-face interactions with an adult partner, who is wearing a camera. I will show that first person video can support fine-grained coding of gaze (differentiating looks to eyes vs. face), which is valuable for autism research. Going further, I will present a method for automatically detecting moments of eye contact. This is joint work with Zhefan Ye, Sarah Edmunds, Dr. Alireza Fathi, Dr. Agata Rozga and Dr. Wendy Stone.

Bio:

Yin Li is currently a doctoral candidate in the School of Interactive Computing at the Georgia Institute of Technology. His research interests lie at the intersection of computer vision and mobile health. Specifically, he creates methods and systems to automatically analyze first person videos, known as First Person Vision (FPV). He has particular interests in recognizing the person's activities and developing FPV for health care applications. He is the co-recipient of the best student paper awards at MobiHealth 2014 and IEEE Face & Gesture 2015. His work had been covered by MIT Tech Review, WIRED UK and New Scientist. Homepage: http://yinli.cvpr.net/


Learning Object Pose for Autonomous Manipulation

Speaker: David (Dave) M.S. Johnson, Draper

Time: 11:00 am to 12:00 pm

Date: Tuesday, Feb. 14, 2017

Location: 32-D463 (Star Room)

Abstract:

The first deployed autonomous systems were aircraft and submarines, as they operate in benign environments with few obstacles. Given the available computational resources, significant effort was required to control platform dynamics and perform rudimentary path planning and obstacle avoidance. Recently, we have seen an explosion of self-driving cars and other autonomous ground vehicles which are able to recognize and avoid obstacles in complex environments, such as city streets. However, these systems still have a limited interaction with their environment – only collision avoidance.  In order to develop autonomous systems which can not only navigate in human environments, but also interact with them – in addition to object recognition, we need to develop robust methods for real-time pose estimation, manipulation, and task planning. I will discuss Draper’s current effort, in collaboration with MIT and Harvard, to develop a system for autonomous mobile manipulation, with a focus on our method for predicting object pose in real-time using a combination of semantic image segmentation and 3D-model fitting to the segmented point cloud.

Bio:

David (Dave) M.S. Johnson runs the Advanced Technology Group at Draper, which investigates novel approaches to autonomous systems. He received his Ph.D. in Physics from Stanford University in 2011, and his B.S. in Physics from Yale University in 2004. Currently, his research focuses on machine learning for object recognition, manipulation, and task planning.


Predictive Vision

Speaker: Carl Vondrick, MIT CSAIL

Time: 11:00 am to 12:00 pm

Date: Tuesday, Feb. 7, 2016

Location: 32-D463 (Star Room)

Abstract: 

Machine learning is revolutionizing our world: computers can recognize images, translate language, and even play games competitively with humans. However, there is a missing piece precluding machine intelligence. My research studies Predictive Vision with the goal of anticipating possible future events. To tackle this challenge, I present predictive vision algorithms that learn directly from large amounts of raw, unlabeled data. Capitalizing on millions of natural videos, my work develops methods for machines to learn to anticipate the visual future, forecast human actions, and recognize ambient sounds. Predictive vision provides a framework for learning from data to simulate events, enabling new applications across health, graphics, and robotics. 

Bio: 

Carl Vondrick is a doctoral student at the Massachusetts Institute of Technology (MIT) where he researches and develops computer vision and machine learning technology. His research was awarded the Google PhD Fellowship, the NSF Graduate Fellowship, and is widely featured in popular press, such as NPR, CNN, the Associated Press, and the Late Show with Stephen Colbert.


Date: 12/13/16

Speakers: Bowen Baker & Otkrist Gupta 

Title: Designing Neural Network Architectures Using Reinforcement Learning


At present, designing convolutional neural network (CNN) architectures requires both human expertise and labor. New architectures are handcrafted by careful experimentation or modified from a handful of existing networks. We propose a meta-modeling approach based on reinforcement learning to automatically generate high-performing CNN architectures for a given learning task. The learning agent is trained to sequentially choose CNN layers using Q-learning with an epsilon-greedy exploration strategy and experience replay. The agent explores a large but finite space of possible architectures and iteratively discovers designs with improved performance on the learning task. On image classification benchmarks, the agent-designed networks (consisting of only standard convolution, pooling, and fully-connected layers) beat existing networks designed with the same layer types and are competitive against the state-of-the-art methods that use more complex layer types. We also outperform existing meta-modeling approaches for network design on image classification tasks.


Date: 11/29/16

Speaker: Hao Su

Title: 3D object reconstruction and abstraction by deep learning


Computational methods for 3D perception from single images have been attracting increasing attention recently. In particular, deep neural networks have shown promising ability to learn the priors for object shapes from emerging large-scale 3D shape databases. The majority of extant works resort to regular representations such as volumetric grids or collection of images; however, these representations obscure the natural invariance and simple manipulation when it comes to geometric transformations and deformations.


In this talk I will introduce my latest progress on generative networks for 3D geometry based on representations that are unorthodox in the deep learning community, focusing on two tasks:

* High-quality point cloud generation. We build a conditional sampler to predict multiple plausible 3D point clouds from a single input image. The shapes predicted by our algorithm demonstrate significantly better global structure compared with those from volumetric CNNs.

* 3D shape abstraction by geometric primitives. We present a framework for abstracting complex shapes by learning to assemble objects using 3D geometric primitives such as cuboids. Experiments have shown that our unsupervised shape abstraction method produces results that are quite consistent with human annotations.



Date: 11/22/16

Speaker: Jan Wegner 

Title: Large-scale Geospatial Computer Vision: Cities, Point Clouds, Trees


The ever increasing amount of geocoded images at varying scale, view point, and temporal resolution provides a treasure trove of information for better understanding our environment, to help making better decisions, managing resources, and improve quality of life, particularly in big cities. Geospatial computer vision combines vision and machine learning techniques that scale, to solve real-world problems. In this talk I will present three ongoing projects:


(a)Cities: Large-scale semantic 3D reconstruction casts 3D modeling and semantic labeling as a joint problem, where semantic image segmentation enforces class-dependent, geometric priors for reconstruction, while 3D benefits semantic labelling via one joint, convex energy formulation. This leads to more accurate 3D city models, that come with category labels directly.


(b)Point Clouds: Unstructured point clouds from multi-view stereo or laser scanners are a major data source for scene analysis in cities. What makes their analysis challenging is the anisotropic point density, self-occlusions, and sheer size of millions of points. We aim at efficient, direct prediction of CAD models from unstructured point clouds through contour extraction. Such contours shall either be completed manually with minimal effort, or be transferred to standard CAD software directly.


(c)Trees: This project, in collaboration with the Caltech Computational Vision Lab, aims to automatically catalogue trees in public space, classify them at species level, and measure their trunk diameter, to support urban planning as well as ecological. We propose an automated, image-based system to build up-to-date tree inventories at large scale using publicly available aerial images, panoramas at street-level, and open GIS data of US cities.



Date: 11/1/16

Speaker: Nikhil Naik

Title: Visual Urban Sensing


Social scientists are extremely interested in understanding (i) the relationship between urban appearance and the behavior and health of urban residents; and (ii) the relationship between urban change and socioeconomic composition, as well as government policies. Thus far, studying these questions has proved challenging due to the fact that researchers need to conduct expensive and time-consuming field surveys to evaluate urban appearance. 

I will introduce two computer vision algorithms that harness Street View imagery to computationally evaluate urban appearance. The first algorithm, Streetscore, is able to quantify the appearance of a street block from its Street View image by measuring attributes such as perceived safety, wealth, and beauty. The second algorithm is able to compute an "urban change coefficient"  which quantifies the growth or decay of a location from time-series Street View images obtained over several years. This approach allows us to generate cross-sectional and longitudinal data on urban appearance at street block-level resolution and global scale, massively scaling up the size and scope of research in this area. 




Date: 10/25/16

Speaker: Genevieve Patterson 

Title: COCO Attributes: Attributes for People, Animals and Objects


With the goal of enabling deeper object understanding, we deliver the largest attribute dataset to date.  Using our COCO Attributes dataset, a fine-tuned classification system can do more than recognize object categories -- for example, rendering multi-label classifications such as "sleeping spotted curled-up cat'' instead of simply "cat''. To overcome the expense of annotating thousands of COCO object instances with hundreds of attributes, we present an Economic Labeling Algorithm (ELA) which intelligently generates crowd labeling tasks based on correlations between attributes. The ELA offers a substantial reduction in labeling cost while largely maintaining attribute density and variety. Currently, we have collected 3.5 million object-attribute pair annotations describing 180 thousand different objects. This talk will also discuss unsolved annotation problems and future work on this dataset and related crowd-annotation efforts.



Date: 6/21/16

Speaker: Ronen Basri

Title: Deformation models for image and shape matching


Modeling deformations is important for various applications in computer vision, graphics and geometry processing. In this talk I will describe our recent progress in modeling deformations. In particular, I will describe methods for computing bounded distortion transformations, locally injective maps whose differentials' conformal distortion is bounded. Toward this end, we developed a convex framework for solving optimization problems over matrices that involve functionals and constraints expressed in terms of their extremal singular values. In addition, I will describe methods for computing physically-motivated elastic maps between shapes. We have applied these methods in a number of challenging problems, including feature matching between images related by non-rigid deformation, non-rigid registration of shape models, and computing extremal quasi-conformal maps.



Place and time: 10/18/16

Speaker: Ayan Sinha

Title: Deep learning 3D shape surfaces using geometry images


Surfaces serve as a natural parametrization to 3D shapes. Learning surfaces using convolutional neural networks (CNNs) is a challenging task. Current paradigms to tackle this challenge are to either adapt the convolutional filters to operate on surfaces, learn spectral descriptors defined by the Laplace-Beltrami operator, or to drop surfaces altogether in lieu of voxelized inputs. I instead adopt an approach of converting the 3D shape into a ‘geometry image’ so that standard CNNs can directly be used to learn 3D shapes. We qualitatively and quantitatively validate that creating geometry images using authalic parametrization on a spherical domain is suitable for robust learning of 3D shape surfaces. This spherically parameterized shape is then projected and cut to convert the original 3D shape into a flat and regular geometry image. I propose a way to implicitly learn the topology and structure of 3D shapes using geometry images encoded with suitable features. I show the efficacy of my approach to learn 3D shape surfaces for classification and retrieval tasks on non-rigid and rigid shape datasets. I also discuss possible extensions of this approach for generative modeling of 3D shapes using current deep architectures for generating images.  



Place and time: Tuesday 10/4/16

Speaker: Emma Alexander

Title: Focal Flow: Measuring Distance and Velocity with Defocus and Differential Motion


We present the focal flow sensor. Inspired by the unique visual system of the jumping spider, it is an unactuated, monocular camera that simultaneously exploits defocus and differential motion to measure a depth map and a 3D scene velocity field. It achieves surprising efficiency by using an optical-flow-like, per-pixel linear constraint that relates image derivatives to depth and velocity. We derive this constraint, which is invariant to scene texture and is exactly satisfied only when the sensor's blur kernels are Gaussian. Experiments show useful depth and velocity information for a broader set of aperture configurations, including a simple lens with a pillbox aperture.



Date: 9/27/16

Speaker: Carl Vondrick

Title: Generating Videos with Scene Dynamics


We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation. This work will appear at NIPS 2016.



Date: 9/20/16

Speaker: Yedid Hoshen

Title: End-to-End Learning: Applications in Speech, Vision and Cognition


One of the most exciting possibilities opened by deep neural networks is end-to-end learning: the ability to learn tasks without the need for feature engineering or breaking down into sub-tasks. This talk will present three cases illustrating how end-to-end learning can operate in machine perception across the senses (Hearing, Vision) and even for the entire perception-cognition-action cycle.


The talk begins with speech recognition, showing how acoustic models can be learned end-to-end. This approach skips the feature extraction pipeline, carefully designed for speech recognition over decades.


Proceeding to vision, a novel application is described: identification of photographers of wearable video cameras. Such video was previously considered anonymous as it does not show the photographer.


The talk concludes by presenting a new task, encompassing the full perception-cognition-action cycle: visual learning of arithmetic operations using only pictures of numbers. This is done without using or learning the notions of numbers, digits, and operators.




Date: 8/23/16

Speaker: Yoav Schechner 

Title:  Distributed Imaging Networks in Scattering Media


This talk is about distributed multi-view imaging via 3D volumetric scattering and occlusions. This is applicable to atmospheric, underwater and tissue imaging. In a very large scale, we derive ways to recover 3D scatterer distributions in the atmosphere, using a new kind of tomography, novel ground and spaceborne distributed camera systems. In denser media, artificial lighting is needed. There, we jointly optimize the path of platforms carrying either a camera or a light source. This generalizes next-best-view and robotic path planing to scattering media and cooperative movable lighting.



Date: 6/14/16

Speaker: David Fouhey

Title: Towards A Physical and Human-Centric Understanding of Images 

One primary goal of AI from its very beginning has been to develop systems that can understand an image in a meaningful way. While we have seen tremendous progress in recent years on naming-style tasks like image classification or object detection, a meaningful understanding requires going beyond this paradigm. Scenes are inherently 3D, so our understanding must also capture the underlying 3D and physical properties. Additionally, our understanding must be human-centric since any man-made scene has been built with humans in mind. Despite the importance of obtaining a 3D and human-centric understanding, we are only beginning to scratch the surface on both fronts: many fundamental questions, in terms of how to both frame and solve the problem, remain unanswered.  

In this talk, I will discuss my efforts towards building a physical and human-centric understanding of images. I will present work addressing the questions: (1) what 3D properties should we model and predict from images, and do we actually need explicit 3D training data to do this? (2) how can we reconcile data-driven learning techniques with the physical constraints that exist in the world? and (3) how can understanding humans improve traditional 3D and object recognition tasks? 



Date: 6/7/16

Speaker: Ali Jahanian

Title: Web Page Gist: What Can You See in One Fixation?


What can users see in a web page at a glance, i.e. in a single fixation? The answer informs our understanding of user experience, as it determines not only what tasks the user can accurately perform within the first few hundred milliseconds, but more broadly the usability of the page. Previous studies in rapid exposure have focused on subjective judgments, e.g. visual appeal. However, we ask directly whether any semantic content can be seen. To this end, we test performance in three objective tasks done at a glance: Web page categorization, identification of ads, and localization of a navigational menu. We find that users are well above chance at classifying web pages into one of our ten chosen categories, and evidence that in doing so they can make use of a small amount of readable text. Users are generally quite good at localizing the menu bar, but less good at identifying ads, perhaps because advertisers aim to make this identification difficult.


Date: 5/24/16

Speaker: Guy Rosman

Title: Persistent and Adaptive Robotic Perception


We envision a world of pervasive robots, with robots operating for long periods of time, and adapting to many different tasks and environments. The long duration of operations and extended variety of tasks require us to summarize the history of the robot in an efficient manner.


Consider for example a robot that collects a stream of data from cameras for a week and has to answer the instantaneous question “Have I been here before?” It is not possible to answer this question on the required time scale by looking linearly through all the data. How can we develop intermediate data representations and algorithms that give robots the option to use historical data in real-time decision making? I will describe a new coreset algorithm for compressing vision data and its applications to search and retrieval.  


Another current limitation of many robots is that their sensors operate in a fixed manner, regardless of the task or world state estimate. In the second part of the talk I will discuss a new approach for information-driven 3D sensing with robots, using adaptive time-multiplexed structured light. I will show how to optimally choose projector patterns for these scanners based on focused information maximization. This allows to adapt 3D scanning to different tasks, according to prior scene uncertainty, and the task of interest. I demonstrate the approach for the tasks of range estimation and localization.


Date: 5/10/16

Speaker: Guy Satat

Title: Computational Imaging Through Scattering


Imaging through scattering media has long been a challenge, as scattering corrupts scenes in a non-invertible way. Using visible wavelengths to image through scattering media can realize broad applications in bio-medical and industrial imaging, as it provides many advantages, such as optical contrast, non-ionizing radiation and availability of fluorescent tags. In this talk I'll discuss a computational imaging technique to overcome and use scattering in order to recover scene parameters. Specifically, the method is able to recover the location of fluorescent markers hidden behind turbid layer and classify them based on florescence lifetime analysis. This approach has applications in in-vivo fluorescence lifetime imaging.


Date: 5/03/16

Speaker: Georgia Gkioxari

Title: Contextual Recognition using Convolutional Neural Networks


Objects exhibit organizational structure in real-world settings, as suggested by work in psychophysics (Biederman et al, 1982). Can we leverage the organizational nature of objects in scenes in order to build strong recognition engines? In this talk, I present recent work which demonstrates how contextual reasoning can improve the detection accuracy of highly contextual objects, as found in the MS COCO dataset. I then turn to the most interesting “object” of all, person, and focus on action recognition, attribute classification, and pose estimation from images and videos containing people. For action and attribute classification, I demonstrate state of the art results with a model which learns to extract contextual cues in both an instance-specific and category-specific manner, succeeding to capture the areas of interest without hand-crafting parts or scene features. Finally, exploiting the nature of structured tasks, I show how the contextual signal derived from the ordering of the sub-tasks can lead to very competitive results for the task of pose estimation from images and video.



Date: 4/26/16

Speaker: Al Hero 

Title : Graph continuum limits and applications 


Many problems in data science fields including data mining, computer vision, and machine learning involve combinatorial optimization over a graphs, e.g., minimal spanning trees, traveling salesman tours, or k-point minimal graphs over a feature space.  Certain properties of minimal graphs like their length, minimal paths, or span have continuum limits  as the number of nodes approaches infinity. These include problems  arising in spectral clustering, statistical classification, multi-objective learning, and anomaly detection. In some cases these continuum limits lead to analytical approximations that can  break the combinatorial bottleneck.  I will present an overview of some of the  theory of graph continuum limits and illustrate with  applications  anomaly detection,  computational imaging and data mining.


Date: 4/19/16

Speaker: Yibiao Zhao

Title : A Quest for Visual Commonsense: Scene Understanding by Functional and Physical Reasoning

Computer vision has made significant progress in locating and recognizing objects in recent decades. However, it lacks the abilities to understand scenes characterizing human visual experience. Comparing with human vision, what is missing in current computer vision? One answer is that human vision is not only for pattern recognition, but also supports a rich set of commonsense reasoning about object function, scene physics, social intentions etc. 


I build systems for real world applications and simultaneously pursuing a long-term goal of devising a unified framework that can make sense of an images and a scene by reasoning about the functional and physical mechanisms of objects in a 3D world. By bridging advances spanning fields of stochastic learning, computer vision, cognitive science, my research tackles following challenges: 


(i) What is the visual representation? I develop stochastic grammar models to characterize spatiotemporal structures of visual scenes and events. The analogy of human natural language lays a foundation for representing both visual structure and abstract knowledge. 


(ii) How to reason about the commonsense knowledge? I augment the commonsense knowledge about functionality, physical stability to the grammatical representation. The bottom-up and top-down inference algorithms are designed for finding a most plausible interpretation of visual stimuli. 


(iii) How to acquire commonsense knowledge? I performed three case studies to acquire different kinds of commonsense knowledge: I teach the computer to learn affordance from observing human actions; to learn tool-use from single one-shot demonstration; and to infer containing relations by physical simulation without explicit training process. 


Such sophisticated understanding of 3D scenes enables computer vision to reason, predict, interact with the 3D environment, as well as hold intelligent dialogues beyond visible spectrum. 



Date: 4/12/16

Speaker:Yu Xiang - visiting from Stanford University 

Title: 3D Object Representations for Recognition


Object recognition from images is a longstanding and challenging problem in computer vision. The main challenge is that the appearance of objects in images is affected by a number of factors, such as illumination, scale, camera viewpoint, intra-class variability, occlusion, truncation, and so on. How to handle all these factors in object recognition is still an open problem. In this talk, I present my efforts in building 3D object representations for object recognition. Compared to 2D appearance based object representations, 3D object representations can capture the 3D nature of objects and better handle viewpoint variation, occlusion and truncation in object recognition. I will also talk about our work on building benchmark datasets for 3D object recognition.



Date: 4/5/16

Speaker: Maros Blaha 

Title: Large-Scale Semantic 3D Reconstruction: an Adaptive Multi-Resolution Model for Multi-Class Volumetric Labeling


We propose an adaptive multi-resolution formulation of semantic 3D reconstruction. Given a set of images of a scene, semantic 3D reconstruction aims to densely reconstruct both the 3D shape of the scene and a segmentation into semantic object classes. Jointly reasoning about shape and class allows one to take into account class-specific shape priors, leading to improved reconstruction results. So far, semantic 3D reconstruction methods have been limited to small scenes and low resolution, because of their large memory footprint and computational cost. To scale them up to large scenes, we propose a hierarchical scheme which refines the reconstruction only in regions that are likely to contain a surface, exploiting the fact that both high spatial resolution and high numerical precision are only required in those regions. Our scheme amounts to solving a sequence of convex optimizations while progressively removing constraints, in such a way that the energy, in each iteration, is the tightest possible approximation of the underlying energy at full resolution. In our experiments the method saves up to 98% memory and 95% computation time, without any loss of accuracy.


Date: 3/29/16

Speaker: Jeroen Chua 

Title: Stochastic scene grammars


The primary motivation of this work is that contextual information provides important cues for object detection and localization. It is also useful in other tasks, such as contour detection and image segmentation.


We present a generic grammar-based framework in which the ability to use contextual information is a key feature. For inference, we convert our grammar-based model to a factor graph, and run loopy belief propagation (LBP). We also define grammar transformations to perform LBP in time sub-linear in the number of edges of the original factor graph. Our approach is applicable to a wide range of vision tasks ranging from contour detection to object detection. We test our framework on contour detection and object detection.



Date: 3/22/16

Speaker: Achuta Kadambi

Title: Computational Imaging at Human-scale, for 3D Imaging and Beyond


Computational imaging is a rapidly growing research topic. A key aim is to jointly design optical capture and post-processing algorithms to rethink the imaging system. Historically, prior art clusters around two extremes: small scale (e.g. microscopy) or very large scale (e.g. astronomy). Recently, a growing number of technologies are able to operate at "human-scale",i.e., the meter-size scenes humans are in contact with. In particular, I will discuss recent work on computational 3D cameras, that acquire high-quality 3D shape using the polarization of light (an ICCV 2015 paper). 



Date: 3/1/16

Speaker: Matthew Johnson 

Title: Structured VAEs: combining probabilistic graphical models and variational autoencoders


I'll talk about a new way to compose probabilistic graphical models with deep learning methods and combine their respective strengths. The method uses graphical models to express structured probability distributions and recent advances from deep learning to learn flexible feature models and bottom-up recognition networks. All components of these models are learned simultaneously using a single mean field objective, and I'll develop scalable fitting algorithms that can leverage natural gradient stochastic variational inference, graphical model message passing, and backpropagation with the reparameterization trick. I'll motivate these methods with an application to mouse behavioral phenotyping using Kinect depth video data.




Date: 2/23/16

Speaker: Nikhil Naik

Title: Mitigating Multipath Interference in Time-of-Flight Sensors using Light Transport Information


Continuous-wave Time-of-flight (TOF) range imaging has become a commercially viable technology with many applications in computer vision and graphics. However, the depth images obtained from TOF cameras contain scene dependent errors due to multipath interference (MPI). Specifically, MPI occurs when multiple optical reflections return to a single spatial location on the imaging sensor. Many prior approaches to rectifying MPI rely on sparsity in optical reflections, which is an extreme simplification. In this work, we correct MPI by combining the standard measurements from a TOF camera with information from direct and global light transport. We report results on both simulated experiments and physical experiments (using the Kinect sensor). Our results, evaluated against ground truth, demonstrate a quantitative improvement in depth accuracy.



Date: 2/16/16

Speaker: Randi Cabezas

Title: Semantically-Aware Aerial Reconstruction from Multi-Modal Data


We consider a methodology for integrating multiple sensors along with semantic information to enhance scene representations. We propose a probabilistic generative model for inferring semantically-informed aerial reconstructions from multi-modal data within a consistent mathematical framework. The approach, called Semantically-Aware Aerial Reconstruction (SAAR), not only exploits inferred scene geometry, appearance, and semantic observations to obtain a meaningful categorization of the data, but also extends previously proposed methods by imposing structure on the prior over geometry, appearance, and semantic labels. This leads to more accurate reconstructions and the ability to fill in missing contextual labels via joint sensor and semantic information. We introduce a new multi-modal synthetic dataset in order to provide quantitative performance analysis. Additionally, we apply the model to real-world data and exploit OpenStreetMap as a source of semantic observations. We show quantitative improvements in reconstruction accuracy of large-scale urban scenes from the combination of LiDAR, aerial photography, and semantic data. Furthermore, we demonstrate the model’s ability to fill in for missing sensed data, leading to more interpretable reconstructions.



Date: 12/8/15

Speaker: Tim Ragan, founder of TissueVision

Title: TissueVision


TissueVision  specializes in high throughput tissue imaging services and hardware for the pharmaceutical, biotech and academic communities. Its services include whole organ imaging, brain atlasing, and software analyses. The imaging technology is currently being used by its clients to answer questions in cancer, toxicology, neuroscience and other areas. Tim is coming to give a high-level talk about his company, and the types of problems Tissue Visionsolves using vision algorithms



Date: 11/30/2015

Speaker: Aditya Khosla

Title: Building a Semantic Understanding of the Internal Representation of CNNs


The recent success of convolutional neural networks (CNNs) on object recognition has led to CNNs becoming the state-of-the-art approach for a variety of tasks in computer vision. This has led to a plethora of recent works that analyze the internal representation of CNNs in an attempt to unlock the secret to their remarkable performances and provide a means to further improve upon these performances by understanding the shortcomings. While some works suggest that the CNN learns a distributed code for objects, others suggest that they learn a more semantically interpretable representation consisting of various components such as colors, textures, objects and scenes.


In this talk, I explore methods to deepen the semantic understanding of the internal representation of CNNs and propose methods for improving it. Unlike prior work that relies on the manual annotation of each neuron, we propose an approach that uses existing annotation from a variety of datasets to automatically understand the semantics of the firings of each neuron. Specifically, we classify each neuron as being one detecting color, texture, shape, object part, object or scene and apply this to automatically parse images at various levels in a single forward pass of a CNN. We find that despite the availability of ground truth annotation from various datasets, the task of identifying exactly what a unit is doing turns out to be rather challenging. As such, we introduce a visualization benchmark containing the annotations of the internal units of popular CNN models, allowing for further research to be conducted in a more structured setting.


We demonstrate that our approach performs well on this benchmark and can be applied to answering a number of questions related to CNNs: How do the semantics of neurons evolve during training? Do they latch on to specific concepts and stick to them, or do they fluctuate? Do the semantics learned by a network differ when training from scratch or fine-tuning? How does the representation change if the image set is the same but the label space changes?


Date: 11/10/15

Speaker: Larry Zitnick

Title: The Depth of Our Understanding: Vision, Language, and Common Sense

How deeply do machines understand our world? Recent advances in artificial intelligence can give the impression that machines have achieved a surprising level of understanding. However, if examined more closely we can gain insight into the limitations of current approaches and what problems still remain. A case study using the image captioning task is provided. We explore how results may be misinterpreted, and how difficulties in task evaluation can cloud our judgement of progress. We conclude by discussing new methods for learning based on visual abstraction, and new tasks for evaluating artificial intelligence using visual question answering.


Date: 10/27/15

Speaker: Wenzhen Yuan

Title: Measurement of Force and Slip with a GelSight Tactile Sensor


GelSight is an optical based tactile sensor which applies machine vision techniques in the field of touch. It uses a piece of clear elastomer as contact medium, and applies photometric stereo to calculate the surface normal of the deformed elastomer. It is the first tactile sensor that gets the very high-resolution topography of the contact surface, and the rich information enables us to learn more about touch perception. Our major focus is using GelSight with robots to explore and interact with the environment through touch sensing. In the process, force and slip are also important information, and we found that the planar deformation of the elastomer is a good indication of force and slip during the physical contact. We track the elastomer's planar deformation by adding markers on the elastomer surface, and studies their displacement field under different load conditions. In this talk, I will introduce our progress in measuring the force and slip from the marker displacement fields. Experiments have showed the method is helpful to robots in learning the interaction with the environment.



Date: 10/20/15

Speaker: Sylvain Paris

Title: A Selection of Papers that I Like


I will go through a few selected papers related to photo and video editing that inspire my work. We will talk about image pyramids, alpha matting, tone mapping, and other interesting topics. I will share my perspective on these papers and what I like about them, and leave some time so that everyone can contribute to the discussion. This presentation is a first version of a tutorial that I will give to grad students in France in November; feedback and suggestions are welcome.



Date: 10/13/15

Speaker: Carl Vondrick

Title: Anticipating the future by watching unlabeled video


In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently obtaining this knowledge is through readily available unlabeled video. In this paper, we present a large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate actions and objects in the future. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. We experimentally evaluate this idea on two challenging “in the wild” video datasets, and our results suggest that learning with unlabeled videos can help forecast actions and anticipate objects.


Date: 9/29/15

Speaker: Tali Dekel

Title: Exploring Spatial Variations in a Single Image


Structures and objects, captured in image data, are often “idealized” by the viewer. For example, buildings may seem to be perfectly straight, or repeating structures such as corn’s kernels may seem almost identical. However, in reality, such flawless behavior hardly exists. In this talk I will address a new problem -- analyzing  and visualizing “imperfection”, i.e., departure of objects from their idealized models, given only a single image as input. I will consider this problem under two distinct definitions of idealized model, leading to two new algorithms with applications in various domains including civil engineering, astronomy, and materials defects inspection.



Date: 9/22/15

Speaker: Julian Straub

Title: Directional Environment Perception


From a single view-point of man-made structures to large-scale urban environments, the surface normal distributions exhibit similar characteristic patterns. This research seeks to characterize and utilize those patterns to further scene understanding and environment perception of autonomous agents. Specifically, the program researches connections between local and global structure as well as ways of inferring global structure from local surface normal observations.  Autonomous systems are faced with continuous streams of data.  To address this, inference is based on nonparametric Bayesian models. Such models allow the complexity of the model to flexibly adapt to the data distribution. Additionally, the proposed models and inference algorithms properly utilize the geometry of the unit sphere, the space of directional data such as surface normals.  In recognition of the time-constraints faced by continuously sensing agents, small-variance analysis is employed to derive real-time capable algorithms from Bayesian nonparametric models.