Abstracts

Multimodal feature matching using a Hybrid Convolutional Neural Network

Elad Ben Baruch and Yosi Keller

Abstract. In this work, we propose a novel Convolutional Neural Network (CNN) architecture for the matching of pairs of image patches acquired by different sensors. Our approach utilizes two CNN sub-networks, where the first is a Siamese CNN and the second is a subnetwork consisting of dual non-weight-sharing CNNs. This allows simultaneous joint and disjoint processing of the input pair of image patches. The convergence of the training and the test accuracy are improved by introducing auxiliary losses, and a corresponding hard negative mining scheme. The proposed approach is experimentally shown to compare favorably with contemporary state-of-the-art schemes when applied to multiple datasets of multimodal images.

Deep Diffeomorphic Transformer Networks.

Nicki Skafte Detlefsen, Oren Freifeld and Søren Hauberg

Abstract. Spatial Transformer layers allow neural networks, at least in principle, to be invariant to large spatial transformations in image data. The model has, however, seen limited uptake as most practical implementations support only transformations that are too restricted, e.g. affine or homographic maps, and/or destructive maps, such as thin plate splines. We investigate the use of flexible diffeo-morphic image transformations within such networks and demonstrate that significant performance gains can be attained over currently-used models. The learned transformations are found to be both simple and intuitive, thereby providing insights into individual problem domains. With the proposed framework, a standard convolutional neural network matches state-of-the-art results on face verification with only two extra lines of simple TensorFlow code.

X-ray Computed Tomography Through Scatter

Adam Geva (Technion), Yoav Y. Schechner (Technion), Yonatan Chernyak (Technion), Rajiv Gupta (Massachusetts General Hospital, Harvard Medical School).

Abstract. In current Xray CT scanners, tomographic reconstruction relies only on directly transmitted photons. The models used for reconstruction have regarded photons scattered by the body as noise or disturbance to be disposed of, either by acquisition hardware (an antiscatter grid) or by the reconstruction software. This increases the radiation dose delivered to the patient. Treating these scattered photons as a source of information, we solve an inverse problem based on a 3D radiative transfer model that includes both elastic (Rayleigh) and inelastic (Compton) scattering. We further present ways to make the solution numerically efficient. The resulting tomographic reconstruction is more accurate than traditional CT, while enabling significant dose reduction and chemical decomposition. Demonstrations include both simulations

Based on a standard medical phantom and a real scattering tomography experiment.

Visual Retargeting of Natural Images using an “Internal GAN”

Assaf Shocher, Shai Bagon, Phillip Isola, Michal Irani

Abstract. Good Visual Retargeting changes the global size and aspect ratio of a natural image, while preserving the size and aspect ratio of all its local elements. In this paper we propose a Deep-Learning approach for image retargeting, based on an “Internal GAN” (InGAN). InGAN is an image-specific GAN, which captures the Internal statistics of a single natural image. It trains on a single input image and learns the distribution of its patches. It is then able to synthesize natural looking target images composed from the input image patch-distribution. InGAN is totally unsupervised, and requires no additional data other than the input image itself.

Self functional maps - as signatures for surfaces

Oshri Halimi, Ron Kimmel

Abstract. A classical approach for surface classification is to find a compact algebraic representation for each surface that would be similar for objects within the same class and preserve dissimilarities between classes. We introduce Self Functional Maps as a novel surface representation that satisfies these properties, translating the geometric problem of surface classification into an algebraic form of classifying matrices. The proposed map transforms a given surface into a universal isometry invariant form defined by a unique matrix. The suggested representation is realized by applying the functional maps framework to map the surface into itself. The key idea is to use two different metric spaces of the same surface for which the functional map serves as a signature. Specifically, in this paper, we use the regular and the scale invariant surface laplacian operators to construct two families of eigenfunctions. The result is a matrix that encodes the interaction between the eigenfunctions resulted from two different Riemannian manifolds of the same surface. Using this representation, geometric shape similarity is converted into algebraic distances between matrices.

Shaping the Latent Space for Data Interpolation

Yacov Hel-Or, Alon Oring, Zohar Yahini

Abstract. Auto-encoders are a special type of deep artificial neural network that are trained to obtain a reduced representation for input data. This reduced representation is defined as the latent space. Due to its reduced dimensionality, the representation in the latent space removes data redundancies while retaining the “essence” of the data by means of data reconstruction. An appealing outcome of auto-encoder representation is that it reveals the underlying parameters controlling the data generation. Thus, it is a common practice to interpolate between input examples by applying a linear interpolation in the latent space. Unfortunately, this practice does not rely on any theoretical background and often introduces un-realistic artifacts. In this talk we will show the source of these unrealistic artifacts and propose a new learning paradigm that corrects this phenomena by enforcing the latent space to conform with convex representation space.

The Perception-Distortion Tradeoff

Yochai Blau and Tomer Michaeli

Abstract. Image restoration algorithms are typically evaluated by some distortion measure (e.g. PSNR, SSIM, IFC, VIF) or by human opinion scores that quantify perceived perceptual quality. In this work, we prove mathematically that distortion and perceptual quality are at odds with each other. Specifically, we study the optimal probability for correctly discriminating the outputs of an image restoration algorithm from real images. We show that as the mean distortion decreases, this probability must increase (indicating worse perceptual quality). As opposed to the common belief, this result holds true for any distortion measure, and is not only a problem of the PSNR or SSIM criteria. However, as we show experimentally, for some measures it is less severe (e.g. distance between VGG features). We also show that generative-adversarial-nets (GANs) provide a principled way to approach the perception-distortion bound. This constitutes theoretical support to their observed success in low-level vision tasks. Based on our analysis, we propose a new methodology for evaluating image restoration methods, and use it to perform an extensive comparison between recent super-resolution algorithms.

Deep design of optics

Harel Haim, Shay Elmalem, Alex Bronstein, Emanuel Marom, Raja Giryes

Abstract. We show how it is possible to design the optics of a camera along with its image signal processing (ISP) parts using deep learning. This approach is used to design a phase coded mask that allows getting all-in-focus and depth imaging with improved image quality and depth reconstruction compared to previous designs. We believe that the proposed paradigm may lead to new optical designs and can be used beyond the specifc optical design used in this work.

Dynamic Net: Tuning the Objective Without Re-training

Alon Shoshan, Firas Shama, Roey Mechrez and Lihi Zelnik-Manor

Abstract. One of the key ingredients for successful optimization of modern CNNs is identifying a suitable objective. To date, the objective is fixed a-priori at training time, and any variation to it requires re-training a new network. In this paper we present a first attempt at alleviating the need for re-training. Rather than fixing the network at training time, we train a "Dynamic Net'' that can be modified at inference time. Our approach considers an "objective-space" as the space of all linear combinations of two objectives, and the Dynamic-Net can traverse this objective-space at test-time, without any further training. We show that this upgrades pre-trained networks by providing an out-of-learning extension, while maintaining the performance quality. The solution we propose is fast and allows a user to interactively modify the network, in real-time, in order to obtain the result he/she desires. We show the benefits of such an approach via several different applications.

Guitar Music Transcription from Silent Video

Shir Goldstein and Yael Moses

Abstract. Musical note tracking (NT), identifying the pitch of played notes and their temporal information, is typically computed from audio data. Although audio is the natural source of information for NT, audio-based methods have limitations, mostly for polyphonic music analysis. When a string instrument is played, each of its strings vibrates at a certain frequency, producing a sound wave. We propose a novel, physics-based method for polyphonic NT of string instruments. First, the string vibrations are recovered from silent video captured by a commercial camera mounted on the instrument. These vibrations are also used to detect the string locations in the video. The NT of each string is then computed from a set of 1D signals extracted from the video. Analyzing each string separately allows us to overcome the limitations of audio-based polyphonic NT. By directly considering the expected frequencies of the played notes, their aliases, and their harmonics, we can overcome some limitations posed by the relatively low sampling rate of the camera. For a given frame rate, we analyze the set of notes that cannot be detected due to noise as well as indistinguishable pairs of notes. Our method is tested on real data, and its output is sheet music that can allow musicians to play the visually captured music. Our results show that the visual-based NT method can play an important role in solving the NT problem.

Resultant Based Incremental Recovery of Camera Pose from Pairwise Matches

Yoni Kasten, Meirav Galun, Ronen Basri

Abstract. Incremental (online) structure from motion pipelines seek to recover the camera matrix associated with an image $I_n$ given $n-1$ images, $I_1,...,I_{n-1}$, whose camera matrices have already been recovered. In this paper, we introduce a novel solution to the six-point online algorithm to recover the exterior parameters associated with $I_n$. Our algorithm uses just six corresponding pairs of 2D points, extracted each from $I_n$ and from \textit{any} of the preceding $n-1$ images, allowing the recovery of the full six degrees of freedom of the $n$'th camera, and unlike common methods, does not require tracking feature points in three or more images. Our novel solution is based on constructing a Dixon resultant, yielding a solution method that is both efficient and accurate compared to existing solutions. We further use Bernstein's theorem to prove a tight bound on the number of complex solutions. Our experiments demonstrate the utility of our approach.

Non-Adversarial Mappings

Yedid Hoshen and Lior Wolf

Abstract. Several methods were recently proposed for the task of translating images between domains without prior knowledge in the form of correspondences. The existing methods apply adversarial learning to ensure that the distribution of the mapped source domain is indistinguishable from the target domain, which suffers from known stability issues. In addition, most methods rely heavily on “cycle” relationships between the domains, which enforce a one-to-one mapping. In this work, we introduce an alternative method: Non-Adversarial Mapping (NAM), which separates the task of target domain generative modeling from the cross-domain mapping task. NAM relies on a pre-trained generative model of the target domain, and aligns each source image with an image synthesized from the target domain, while jointly optimizing the domain mapping function. It has several key advantages: higher quality and resolution image translations, simpler and more stable training and reusable target models. Extensive experiments are presented validating the advantages of our method.

Latent RANSAC

Simon Korman and Roee Litman

Abstract. We present a method that can evaluate a RANSAC hypothesis in constant time, i.e. independent of the size of the data. A key observation here is that correct hypotheses are tightly clustered together in the latent parameter domain. In a manner similar to the generalized Hough transform we seek to find this cluster, only that we need as few as two votes for a successful detection. Rapidly locating such pairs of similar hypotheses is made possible by adapting the recent "Random Grids" range-search technique. We only perform the usual (costly) hypothesis verification stage upon the discovery of a close pair of hypotheses. We show that this event rarely happens for incorrect hypotheses, enabling a significant speedup of the RANSAC pipeline. The suggested approach is applied and tested on three robust estimation problems: camera localization, 3D rigid alignment and 2D-homography estimation. We perform rigorous testing on both synthetic and real datasets, demonstrating an improvement in efficiency without a compromise in accuracy. Furthermore, we achieve state-of-the-art 3D alignment results on the challenging "Redwood" loop-closure challenge.

Matching Pixels using Co-Occurrence Statistics

Rotal Kat, Roy Jevnisek, Shai Avidan

Abstract. We propose a new error measure for matching pixels that is based on co-occurrence statistics. The measure relies on a co-occurrence matrix that counts the number of times pairs of pixel values co-occur within a window. The error incurred by matching a pair of pixels is inversely proportional to the probability that their values co-occur together, and not their color difference. This measure also works with features other than color, e.g. deep features. We show that this improves the state-of-the-art performance of template matching on standard benchmarks. We then propose an embedding scheme that maps the input image to an embedded image such that the Euclidean distance between pixel values in the embedded space resembles the co-occurrence statistics in the original space. This lets us run existing vision algorithms on the embedded images and enjoy the power of co-occurrence statistics for free. We demonstrate this on two algorithms, the Lucas-Kanade image registration and the Kernelized Correlation Filter (KCF) tracker. Experiments show that performance of each algorithm improves by about 10%.

3DmFV: 3D Point Cloud Classification in Real-Time using Convolutional Neural Networks

Yizhak (Itzik) Ben-Shabat, Michael Lindenbaum, Anath Fischer

Abstracts. Modern robotic systems are often equipped with a direct 3D data acquisition device, e.g. LiDAR, which provides a rich 3D point cloud representation of the surroundings. This representation is commonly used for obstacle avoidance and mapping. Here, we propose a new approach for using point clouds for another critical robotic capability, semantic understanding of the environment (i.e. object classification). Convolutional neural networks (CNN), that perform extremely well for object classification in 2D images, are not easily extendible to 3D point clouds analysis. It is not straightforward due to point clouds’ irregular format and a varying number of points. The common solution of transforming the point cloud data into a 3D voxel grid needs to address severe accuracy vs memory size tradeoffs. In this paper we propose a novel, intuitively interpretable, 3D point cloud representation called 3D Modified Fisher Vectors (3DmFV). Our representation is hybrid as it combines a coarse discrete grid structure with continuous generalized Fisher vectors. Using the grid enables us to design a new CNN architecture for real-time point cloud classification. In a series of performance analysis experiments, we demonstrate competitive results or even better than state-of-the-art on challenging benchmark datasets while maintaining robustness to various data corruptions.

Video (4 minute summary of the paper): https://www.youtube.com/watch?v=jLNxXNChwdk

Object Segmentation in Videos via Convolutional LSTM Networks and Adversarial Loss

Assaf Arbelle and Tammy Riklin Raviv

Abstract. The analysis of cluttered objects in a video sequence is a challenging task, particularly in the presence of complex spatial structures and complicated temporal changes. We present a Deep Neural Network framework which addresses two aspects of object segmentation within video sequences, namely, limited annotated training data and the inherent dependencies between video frames. In order to compensate for the limited training data and avoid overfitting, we introduce an adversarial loss, inspired by Goodfellow et. al, and propose a unique discriminator architecture, termed the Rib-Cage network. The Rib-Cage network is designed such that multi-level features of both the image and segmentation maps are compared at multiple scales allowing for the extraction of complex joint representations. Furthermore, we propose the integration of the U-Net architecture (Ronneberger et. al) with Convolutional Long Short Term Memory (C-LSTM). The segmentation network’s unique architecture enables it to capture multi-scale, compact, spatio-temporal encoding of the objects in the C-LSTMs memory units. The proposed network exploits temporal cues which facilitate the individual segmentation of touching or partially occluded objects. The method was applied to live cell microscopy data and tested on the common cell segmentation benchmark, the Cell Tracking Challenge (www.celltrackingchallenge.net), and ranked 1st and 2nd place on two challenging datasets.