Authors: Tali Dekel, Erika Lu, Forrester Cole, Weidi Xie, Miki Rubinstein, David Salesin, Andrew Zisserman and Bill Freeman
Abstract:
By speeding up, slowing down, or synchronizing people’s movements, we can change our perception of an event recorded in a video. In films, such manipulation of time is widely used for dramatizing or de-emphasizing certain actions or events, but involves a significant amount of manual work and professional equipment. In this talk, I’ll present “Layered Neural Rendering for Retiming People” (SIGGRAPH Asia’20): a recent method that brings retiming effects into the realm of everyday videos and achieves it computationally in post-processing. The pillar of the technique is a novel learning-based decomposition of the input video into human-specific RGBA (color + opacity) layers. The key property of our decomposition is that each layer not only represents a person in the video but also all the scene elements that are related to them, such as shadows, reflections, and loose clothing. Our layered neural renderer learns this decomposition just by observing the input video, without requiring any manual labels. I’ll show various retiming effects of people dancing, groups running, and kids jumping on trampolines.
Authors: Irit Chelly (BGU), Vlad Winter (BGU), Dor Litvak (BGU), Oren Freifed (BGU), David Rosen (MIT).
Abstract:
Background models are widely used in computer vision. While successful Static-camera Background (SCB) models exist, Moving-camera Background (MCB) models are limited. Seemingly, there is a straightforward solution: 1) align the video frames; 2) learn an SCB model; 3) warp either original or previously-unseen frames toward the model. This approach, however, has drawbacks, especially when the accumulative camera motion is large and/or the video is long. Here we propose a purely-2D unsupervised modular method that systematically eliminates those issues. First, to estimate warps in the original video, we solve a joint-alignment problem while leveraging a certifiably-correct initialization. Next, we learn both multiple partially-overlapping local subspaces and how to predict alignments. Lastly, in test time, we warp a previously-unseen frame, based on the prediction, and project it on a subset of those subspaces to obtain a background/foreground separation. We show the method handles even large scenes with a relatively-free camera motion (provided the camera-to-scene distance does not change much) and that it not only yields State-of-the-Art results on the original video but also generalizes gracefully to previously-unseen videos of the same scene. The talk is based on [Chelly et al., CVPR'20].
Authors: Chen Bar (Technion), Ioannis Gkioulekas (CMU), ANAT LEVIN (Technion)
Abstract:
We introduce rendering algorithms for the simulation of speckle statistics observed in scattering media under coherent near-field imaging conditions.Our work is motivated by the recent proliferation of techniques that use speckle correlations for tissue imaging applications: The ability to simulate the image measurements used by these speckle imaging techniques in a physically-accurate and computationally-efficient way can facilitate the widespread adoption and improvement of these techniques. To this end,we draw inspiration from recently-introduced Monte Carlo algorithms for rendering speckle statistics under far-field conditions (collimated sensor and illumination). We derive variants of these algorithms that are better suited to the near-field conditions (focused sensor and illumination) requiredby tissue imaging applications. Our approach is based on using Gaussian Apodization to approximate the sensor and illumination aperture, as wellas von Mises-Fisher functions to approximate the phase function of the scattering material. We show that these approximations allow us to derive closed-form expressions for the focusing operations involved in simulating near-field speckle patterns. As we demonstrate in our experiments, these approximations accelerate speckle rendering simulations by a few orders of magnitude compared to previous techniques, at the cost of negligible bias. We validate the accuracy of our algorithms by reproducing ground truth speckle statistics simulated using wave-optics solvers, and real-material measurements available in the literature. Finally, we use our algorithms to simulate biomedical imaging techniques for focusing through tissue.
Authors: Or Patashnik, Dov Danon, Hao Zhang, Daniel Cohen-Or
Abstract:
State-of-the-art image-to-image translation methods tend to struggle in an imbalanced domain setting, where one image domain lacks richness and diversity. We introduce a new unsupervised translation network, BalaGAN, specifically designed to tackle the domain imbalance problem. We leverage the latent modalities of the richer domain to turn the image-to-image translation problem, between two imbalanced domains, into a balanced, multi-class, and conditional translation problem, more resembling the style transfer setting. Specifically, we analyze the source domain and learn a decomposition of it into a set of latent modes or classes, without any supervision. This leaves us with a multitude of balanced cross-domain translation tasks, between all pairs of classes, including the target domain. During inference, the trained network takes as input a source image, as well as a reference or style image from one of the modes as a condition, and produces an image which resembles the source on the pixel-wise level, but shares the same mode as the reference. We show that employing modalities within the dataset improves the quality of the translated images, and that BalaGAN outperforms strong baselines of both unconditioned and style-transfer-based image-to-image translation methods, in terms of image quality and diversity.
Authors: Eitan Richardson, Yair Weiss
Abstract:
Unsupervised image-to-image translation is an inherently ill-posed problem. Recent methods based on deep encoder-decoder architectures have shown impressive results, but we show that they only succeed due to a strong locality bias, and they fail to learn very simple nonlocal transformations (e.g. mapping upside down faces to upright faces). When the locality bias is removed, the methods are too powerful and may fail to learn simple local transformations. In this work we introduce linear encoder-decoder architectures for unsupervised image to image translation. We show that learning is much easier and faster with these architectures and yet the results are surprisingly effective. In particular, we show a number of local problems for which the results of the linear methods are comparable to those of state-of-the-art architectures but with a fraction of the training time, and a number of nonlocal problems for which the state-of-the-art fails while linear methods succeed. We will also discuss work in progress about how to extend this approach to the nonlinear case.
Authors: Zongze Wu, Dani Lischinski
Abstract:
In this paper we explore and analyze the latent style space of StyleGAN2, a state-of-the-art architecture for image generation, using models pretrained on several different datasets. Our analysis discovers a large collection of style dimensions, each of which is shown to control a distinct visual attribute in the images generated by the model. These controls are highly localized and are surprisingly well disentangled, enabling manipulation of different visual attributes even when those occupy the same image region. Furthermore, we discover that many semantic concepts are affected by only a few style dimensions, rather than by the entire (intermediate) latent vector. These properties of the StyleGAN2 style space are well aligned with modern disentanglement theory which postulates that an ideal disentangled representation corresponds to a one-to-one mapping between latent dimensions and factors of variation in the data. We demonstrate that our findings pave the way to semantically meaningful, localized, and disentangled image manipulations via a simple and intuitive interface.
Authors: Elad Hirsch, Ayellet Tal
Abstract:
Visual illusions may be explained by the likelihood of patches in real-world images, as argued by input-driven paradigms in Neuro-Science. However, neither the data nor the tools existed in the past to extensively support these explanations. The era of big data opens a new opportunity to study input-driven approaches. We introduce a tool that computes the likelihood of patches, given a large dataset to learn from. Given this tool, we present a model that supports the approach and explains lightness and color visual illusions in a unified manner. Furthermore, our model generates visual illusions in natural images, by applying the same tool, reversely.
Authors: Yuval Nirkin (BIU), Yosi Keller (BIU) , Tal Hassner (FB) and Lior Wolf (FB and TAU)
Abstract:
We propose a method for detecting face swapping and other identity manipulations in single images. Face swapping methods, such as DeepFake, manipulate the face region, aiming to adjust the face to the appearance of its context, while leaving the context unchanged. We show that this modus operandi produces discrepancies between the two regions. These discrepancies offer exploitable telltale signs of manipulation. Our approach involves two networks: (i) a face identification network that considers the face region bounded by a tight semantic segmentation, and (ii) a context recognition network that considers the face context (e.g., hair, ears, neck). We describe a method that uses the recognition signals from our two networks to detect such discrepancies, providing a complementary detection signal that improves conventional real vs. fake classifiers commonly used for detecting fake images. Our method achieves state of the art results on the FaceForensics++, Celeb-DF-v2, and DFDC benchmarks for face manipulation detection, and even generalizes to detect fakes produced by unseen methods.
Authors: Yael Konforti, Alon Shpigler, Boaz Lerner, Aharon Bar-Hillel
Abstract:
Convolutional neural networks (CNNs) have achieved superior accuracy in many visual related tasks. However, the inference process through intermediate layers is opaque, making it difficult to interpret such networks or develop trust in their operation. We propose to model the network hidden layers activity using probabilistic models. The activity patterns in layers of interest are modeled as Gaussian mixture models, and transition probabilities between clusters in consecutive modeled layers are estimated. Based on maximum-likelihood considerations, nodes and paths relevant for network prediction are chosen, connected, and visualized as an inference graph. We show that such graphs are useful for understanding the general inference process of a class, as well as explaining decisions the network makes regarding specific images. In addition, the models provide an interesting observation regarding the highly local nature of column activities in top CNN layers.
Authors: Amir Bar, Roei Herzig, Xiaolong Wang, Gal Chechik, Trevor Darrell, Amir Globerson
Abstract:
Videos of actions are complex signals, containing rich compositional structure. Current video generation models are limited in their ability to generate such videos. To address this challenge, we introduce a generative model (AG2Vid) that can be conditioned on an Action Graph, a structure that naturally represents the dynamics of actions and interactions between objects. Our AG2Vid model disentangles appearance and position features, allowing for more accurate generation. AG2Vid is evaluated on the CATER and Something-Something datasets and outperforms other baselines. Finally, we show how Action Graphs can be used for generating novel compositions of actions.
Authors: Amit Bracha, Oshri Halim, Ron Kimmel
Abstract:
When matching non-rigid shapes, the regular or scale-invariant Laplace-Beltrami Operator (LBO) eigenfunctions could potentially serve as intrinsic descriptors which are invariant to isometric transformations. However, the computed eigenfunctions of two quasi-isometric surfaces could be substantially different. Such discrepancies include sign ambiguities and possible rotations and reflections within subspaces spanned by eigenfunctions that correspond to similar eigenvalues. Thus, without aligning the corresponding eigenspaces it is difficult to use the eigenfunctions as descriptors. Here, we propose to model the relative transformation between the eigenspaces of two quasi-isometric shapes using a band orthogonal matrix, as well as present a framework that aims to estimate this matrix. Estimating this transformation allows us to align the eigenfunctions of one shape with those of the other, that could then be used as intrinsic, consistent, and robust descriptors. To estimate the transformation we use an unsupervised spectral-net framework that uses descriptors given by the eigenfunctions of the scale-invariant version of the LBO. Then, using a spectral training mechanism, we find a band limited orthogonal matrix that aligns the two sets of eigenfunctions.
Authors: Meytal Rapoport-Lavie and Dan Raviv
Abstract:
Modern perception systems in the field of autonomous driving rely on 3D data analysis. LiDAR sensors are frequently used to acquire such data due to their increased resilience to different lighting conditions. Although rotating LiDAR scanners produce ring-shaped patterns in space, most networks analyze their data using an orthogonal voxel sampling strategy.
This work presents a novel approach for analyzing 3D data produced by 360-degree depth scanners, utilizing a more suitable coordinate system, which is aligned with the scanning pattern. Furthermore, we introduce a novel notion of range-guided convolutions, adapting the receptive field by distance from the ego vehicle and the object's scale.
Our network demonstrates powerful results on the nuScenes challenge, comparable to current state-of-the-art architectures. The backbone architecture introduced in this work can be easily integrated onto other pipelines as well
Author: Adam Kaufman and Raanan Fattal
Abstract:
Blind image deblurring remains a challenging problem for modern artificial neural networks. Unlike other image restoration problems, deblurring networks fail behind the performance of existing deblurring algorithms in case of uniform and 3D blur models. This follows from the diverse and profound effect that the unknown blur-kernel has on the deblurring operator.
We propose a new architecture which breaks the deblurring network into an analysis network which estimates the blur, and a synthesis network that uses this kernel to deblur the image. Unlike existing deblurring networks, this design allows us to explicitly incorporate the blur-kernel in the network’s training.
In addition, we introduce new cross-correlation layers that allow better blur estimations, as well as unique components that allow the estimate blur to control the action of the synthesis deblurring action. Evaluating the new approach over established benchmark datasets shows its ability to achieve state-of-the-art deblurring accuracy on various tests, as well as offer a major speedup in runtime.
Authors: Shady Abu Hussein, Tom Tirer, Raja Giryes
Abstract:
The single image super-resolution task is one of the most examined inverse problems in the past decade. In the recent years, Deep Neural Networks (DNNs) have shown superior performance over alternative methods when the acquisition process uses a fixed known downscaling kernel---typically a bicubic kernel. However, several recent works have shown that in practical scenarios, where the test data mismatch the training data (e.g. when the downscaling kernel is not the bicubic kernel or is not available at training), the leading DNN methods suffer from a huge performance drop. Inspired by the literature on generalized sampling, in this work we propose a method for improving the performance of DNNs that have been trained with a fixed kernel on observations acquired by other kernels. For a known kernel, we design a closed-form correction filter that modifies the low-resolution image to match one which is obtained by another kernel (e.g. bicubic), and thus improves the results of existing pre-trained DNNs. For an unknown kernel, we extend this idea and propose an algorithm for blind estimation of the required correction filter. We show that our approach outperforms other super-resolution methods, which are designed for general downscaling kernels. CVPR 2020 paper award nominee.
Authors: Yuval Bahat, Tomer Michaeli
Abstract:
Most image restoration tasks are inherently ambiguous, as a degraded image can typically correspond to infinitely many different natural images. However, despite significant progress, existing image restoration methods do not allow exploring the abundance of natural images that might have given rise to the observed degraded image, and usually produce only a single output. This is a major limitation, as the different explanations to the degraded input can often dramatically vary in textures and fine details and may thus encode completely different semantic information. In this work, we introduce an “explorable image restoration” approach, which allows a user to manipulate the restored output so as to explore the abundance of plausible explanations to the degraded input. We specifically focus on the tasks of super-resolution and image decompression, for which we develop graphical user interfaces with neural network backends. At the heart of our framework lie mechanisms that analytically guarantee the consistency of all outputs with the given input, thus guaranteeing reliable exploration. We illustrate our approach in a variety of use cases, ranging from medical imaging and forensics, to graphics.
Authors: Liad Pollak-Zuckerman, Eyal Naor, George Pisha, Shai Bagon, Michal Irani
Abstract:
When a very fast dynamic event is recorded with a low-framerate camera, the resulting video suffers from severe motion blur (due to exposure time) and motion aliasing (due to low sampling rate in time). True Temporal Super-Resolution (TSR) is more than just Temporal-Interpolation (increasing framerate). It also recovers new high temporal frequencies beyond the temporal nyquist limit of the input video, thus resolving both motion-blur and motion-aliasing. In this paper we propose a "Deep Internal Learning" approach for true TSR. We train a video-specific CNN on examples extracted directly from the low-framerate input video. Our method exploits the strong recurrence of small space-time patches inside a single video sequence, both within and across different spatio-temporal scales of the video. We further observe (for the first time) that small space-time patches recur also across-dimensions of the video sequence - i.e., by swapping the spatial and temporal dimensions. In particular, the higher spatial resolution of video frames provides strong examples as to how to increase the temporal resolution of that video. Such internal video-specific examples give rise to strong self-supervision, requiring no data but the input video itself. This results in Zero-Shot Temporal-SR of complex videos, which removes both motion blur and motion aliasing, outperforming previous supervised methods trained on external video datasets.
Author: Tal Reiss, Niv Cohen, Liron Bergman, Yedid Hoshen
Abstract: Anomaly detection methods require high-quality features. Recently, the leading methods employed self-supervised feature learning. In this talk, we will compare such methods with a simple but effective way of obtaining strong features, i.e. adapting pre-trained features to anomaly detection on the target distribution. Unfortunately, simple adaptation methods often result in catastrophic collapse (feature deterioration) and reduce performance. Previous methods e.g. DeepSVDD combat collapse by modifying existing architectures, but this limits the adaptation performance gain. We propose PANDA, which effectively combats collapse in one-class classification. PANDA significantly outperforms the state-of-the-art in the one-class and outlier exposure settings (CIFAR10: 96.2% vs. 90.1% and 98.9% vs. 95.6%). We will also present results for anomaly segmentation showing further performance gains.
Based on the works:
Classification-Based Anomaly Detection for General Data, Liron Bergman, Yedid Hoshen, ICLR'20
PANDA--Adapting Pretrained Features for Anomaly Detection , Tal Reiss *, Niv Cohen *, Liron Bergman, Yedid Hoshen, Arxiv
Authors: Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi Zelnik-Manor, Shai Avidan
Abstract:
We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of "bag of words" representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not.
We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g., a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal.
Extensive experiments on the benchmarks show that our method performs considerably better than other state of the art methods.
Authors: Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, Ronen Basri
Abstract:
Recent theoretical work has shown that massively overparameterized neural networks are equivalent to kernel regressors that use Neural Tangent Kernels (NTK). Experiments show that these kernel methods perform similarly to real neural networks. Here we show that NTK for fully connected networks is closely related to the standard Laplace kernel. We show theoretically that for normalized data on the hypersphere both kernels have the same eigenfunctions and their eigenvalues decay polynomially at the same rate, implying that their Reproducing Kernel Hilbert Spaces (RKHS) include the same sets of functions. This means that both kernels give rise to classes of functions with the same smoothness properties. The two kernels differ for data off the hypersphere, but experiments indicate that when data is properly normalized these differences are not significant. Finally, we provide experiments on real data comparing NTK and the Laplace kernel, along with a larger class of γ-exponential kernels. We show that these perform almost identically. Our results suggest that much insight about neural networks can be obtained from analysis of the well-known Laplace kernel, which has a simple closed form.
Authors: Yuval Atzmon, Felix Kreuk, Uri Shalit and Gal Chechik
Abstract:
People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not "essential" for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components.
Here we describe an approach for compositional generalization that builds on causal ideas. First, we describe compositional zero-shot learning from a causal perspective, and propose to view zero-shot inference as finding "which intervention caused the image?". Second, we present a causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data. We evaluate this approach on two datasets for predicting new combinations of attribute-object pairs: A well-controlled synthesized images dataset and a real world dataset which consists of fine-grained types of shoes. We show improvements compared to strong baselines.
Authors: Yotam Nitzan, Amit Bermano, Yangyan Li, Daniel Cohen-Or
Abstract:
Learning disentangled representations of data is a fundamental problem in artificial intelligence. Specifically, disentangled latent representations allow generative models to control and compose the disentangled factors in the synthesis process. Current methods, however, require extensive supervision and training, or instead, noticeably compromise quality. In this paper, we present a method that learn show to represent data in a disentangled way, with minimal supervision, manifested solely using available pre-trained networks. Our key insight is to decouple the processes of disentanglement and synthesis, by employing a leading pre-trained unconditional image generator, such as StyleGAN. By learning to map into its latent space, we leverage both its state-of-the-art quality generative power, and its rich and expressive latent space, without the burden of training it. We demonstrate our approach on the complex and high dimensional domain of human heads. We evaluate our method qualitatively and quantitatively, and exhibit its success with de-identification operations and with temporal identity coherency in image sequences. Through this extensive experimentation, we show that our method successfully disentangles identity from other facial attributes, surpassing existing methods, even though they require more training and supervision.
Authors: Gavriel Habib, Nahum Kiryati, Miri Sklair-Levy, Anat Shalmon, Arnaldo Mayer, et al.
Abstract:
Mammography and ultrasound are extensively used by radiologists as complementary modalities to achieve better performance in breast cancer diagnosis. However, existing computer-aided diagnosis (CAD) systems for the breast are generally based on a single modality. In this work, we propose a deep-learning based method for classifying breast cancer lesions from their respective mammography and ultrasound images. We present various approaches and show a consistent improvement in performance when utilizing both modalities. The proposed approach is based on a GoogleNet architecture, fine-tuned for our data in two training steps. First, a distinct neural network is trained separately for each modality, generating high-level features. Then, the aggregated features originating from each modality are used to train a multimodal network to provide the final classification. In quantitative experiments, the proposed approach achieves an AUC of 0.94, outperforming state-of-the-art models trained over a single modality. Moreover, it performs similarly to an average radiologist, surpassing two out of four radiologists participating in a reader study. The promising results suggest that the proposed method may become a valuable decision support tool for breast radiologists.
Authors: Lior Gelberg, David Mendelovic, and Dan Raviv
Abstract:
User identification and continuous user identification are some of the most challenging open problems we face today more than ever in the working-from-home lifestyle due to the COVID-19 pandemic.
The ability to learn a style instead of a secret passphrase opens up a hatch towards the next level of person identification, as style is constructed from a person's set of motions and their relations.
Therefore, analyzing a person's style, rather than rely on its appearance (or some other easily fooled characteristic), can increase the level of security in numerous real-world applications, e.g., VPN, online education, finance, etc..
We present a novel architecture for person identification based on typing-style, constructed of adaptive non-local spatio-temporal graph convolutional network.
Since type style dynamics convey meaningful information that can be useful for person identification, we extract the joints positions and then learn their movements' dynamics.
Our non-local approach increases our model's robustness to noisy input data while analyzing joints locations instead of RGB data provides remarkable robustness to alternating environmental conditions, e.g., lighting, noise, etc..
We further present two new datasets for typing style based person identification task and extensive evaluation that displays our model's superior discriminative and generalization abilities, when compared with state-of-the-art skeleton-based models.
Authors: Roy Shaul, Itamar David, Ohad Shitrit, Tammy Riklin-Raviv
Abstract:
A main challenge in magnetic resonance imaging (MRI) is speeding up scan time. Beyond improving patient experience and reducing operational costs, faster scans are essential for time-sensitive imaging, such as fetal, cardiac, or functional MRI, where temporal resolution is important and target movement is unavoidable, yet must be reduced. Current MRI acquisition methods speed up scan time at the expense of lower spatial resolution and costlier hardware. We introduce a practical, software-only framework, based on deep learning, for accelerating MRI acquisition, while maintaining anatomically meaningful imaging. This is accomplished by MRI subsampling followed by estimating the missing k-space samples via generative adversarial neural networks. A generator-discriminator interplay enables the introduction of an adversarial cost in addition to fidelity and image-quality losses used for optimizing the reconstruction. Promising reconstruction results are obtained from feasible sampling patterns of up to a fivefold acceleration of diverse brain MRIs, from a large publicly available dataset of healthy adult scans as well as multimodal acquisitions of multiple sclerosis patients and dynamic contrast-enhanced MRI (DCE-MRI) sequences of stroke and tumor patients. Clinical usability of the reconstructed MRI scans is assessed by performing either lesion or healthy tissue segmentation and comparing the results to those obtained by using the original, fully sampled images. Reconstruction quality and usability of the DCE-MRI sequences is demonstrated by calculating the pharmacokinetic (PK) parameters. The proposed MRI reconstruction approach is shown to outperform state-of-the-art methods for all datasets tested in terms of the peak signal-to-noise ratio (PSNR), the structural similarity index (SSIM), as well as either the mean squared error (MSE) with respect to the PK parameters, calculated for the fully sampled DCE-MRI sequences, or the segmentation compatibility, measured in terms of Dice scores and Hausdorff distance. The code is available on GitHub.