BIU Machine Learning and Data science

Learning Club

Seminar in machine learning, data science, & applications, @ Bar-Ilan University

Announcements are made in the following group: (link).

Supported by the BIU Data Science Institute.

We meet on Thursdays. Talks begin at 10am.

Google Calendar

Stay tuned and subscribe to our calendar by pressing the plus button at the bottom right corner :)

Upcoming Talks

Karen Livescu from TTIC

Location: Gonda building (901), room 102.

Time: Sunday Dec 22th, 12:00 AM -- 13:00 AM.

Title: Embeddings for spoken words

Abstract: Word embeddings have become a ubiquitous tool in natural language processing. These embeddings represent the meanings of written words. On the other hand, for spoken language it may be more important to represent how a written word *sounds* rather than (or in addition to) what it means. For some applications it can also be helpful to represent variable-length acoustic segments corresponding to words, or other linguistic units, as fixed-dimensional vectors. This talk will present work on both acoustic word embeddings and "acoustically grounded" written word embeddings, including their applications for improved speech recognition and search.

Bio: Karen Livescu is an Associate Professor at TTI-Chicago. She completed her PhD in electrical engineering and computer science at MIT. Her main research interests are in speech and language processing and machine learning. Her recent work includes multi-view representation learning, acoustic word embeddings, visually grounded speech modeling, and automatic sign language recognition. Her recent professional activities include serving as a member of the IEEE Spoken Language Technical Committee, an associate editor for IEEE Transactions on Audio, Speech, and Language Processing, a technical co-chair of ASRU 2015/2017/2019, and a program co-chair of ICLR 2019.

Previous Talks

Only presentations from this year are shown.

Nadav Cohen from TAU

Location: Gonda building (901), room 102.

Time: Sunday Nov 24th, 12:00 AM -- 13:00 AM.

Title: Analyzing Optimization and Generalization in Deep Learning via Trajectories of Gradient Descent

Abstract: Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the specific trajectories taken by the optimizer. I will then demonstrate a manifestation of this approach, analyzing the trajectories of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank, and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix recovery.

Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu and Yuping Luo.

Eran Malach from HUJI (PhD student).

Location: Gonda building (901), room 102.

Time: Sunday Nov 17th, 12:00 AM -- 13:00 AM.

Title: Heuristic learning on the border between success and failure

Abstract: Many popular hypothesis classes, such as neural-networks or decision trees, are computationally hard to learn. In practice, however, heuristic algorithms are used to learn these classes with remarkable success. To better understand this gap, we explore probabilistic models where a small change in the distribution determines whether the optimization process succeeds or fails. We use these models to suggest specific distributional properties that differentiate between problems that are hard or easy to learn using common heuristic algorithms. We show theoretically and empirically that such properties play a key role in learning our "borderline" models, and suggest that they might be relevant for the broader effort of understanding algorithms used in practice.

Boris Ginsburg from NVIDIA.

Location: Gonda building (901), room 102.

Time: Tuesday Nov 12th, 10:00 AM -- 11:00 AM.

End-2-end neural models for automatic speech recognition


I will present 2 models:1) QuartzNet: A deep convolutional acoustic model which achieves state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having 3x-5x fewer parameters than competing models, 2) ASR post-processing model which "translates" ASR output into grammatically and semantically correct text. The Transformer-based encoder-decoder model demonstrates significant improvement in word error rate over the baseline acoustic model with greedy decoding. It outperforms baseline 6-gram language model re-scoring and approaches the performance of re-scoring with Transformer-XL neural language model.


Boris Ginsburg is Principal Engineer in NVIDIA, working on deep learning applications for speech and language processing. He joined NVIDIA in 2015. Before that he worked in Intel on HW for deep learning, CPU, and wireless networking. Boris has Ph.D. in Applied Math from Technion.

Ran Gilad-Bachrach from Tel-Aviv University.

Location: Gonda building (901), room 101.

Time: Sunday 10/11 12:00.

Hacking Classifiers


In this talk will explore ways to stretch classifiers and use them in ways they were not intended to be used. In the first part of the talk we will break the training process of classifiers into its components: (i) selecting a prior (ii) creating a posterior (iii) selecting a candidate classifier. We will focus on the third step and make a case for the use of deep classifiers*. We will show generalization bounds and other properties of such classifiers as well as algorithms to find them. In the second part of the talk we will use classifiers to solve unsupervised learning problems. The problem we are interested at is finding sub-types of diseases from unlabeled data. We will see how this problem can be solved by training a tree model where each node in the tree is a classifier and each leaf is a sub-type of the disease (also known as a cluster).

About the speaker:

Ran Gilad-Bachrach’s research focuses on machine learning and its applications to health and well-being. Prof. Gilad-Bachrach recently joined the faculty of the Bio-Medical Engineering department in Tel-Aviv University. Prior to that he was a principal researcher in Microsoft Research and led a machine learning research team in Intel Research. He did his Ph.D. studies at the Hebrew University of Jerusalem under the supervision of Prof. Naftali Tishby.

June. 6th 2019, Thu. 11:00 , Shrikanth (Shri) Narayanan (webpage).

University of Southern California, Los Angeles, CA (Viterbi School of Engineering, Signal Analysis and Interpretation Laboratory)

Location: Building 902, Room 301.

Behavioral Machine Intelligence from Multimodal Data


The global proliferation of smartphones, and IoT deployments— offers tremendous opportunities for continuous acquisition, analysis and sharing of diverse, information-rich yet unobtrusive time series data that provide a multimodal, spatiotemporal characterization of an individual’s behavior and state, and of the environment within which they operate. This has in turn enabled hitherto unimagined possibilities for understanding and supporting various aspects of human functioning in realms ranging from health and well-being to job performance.

Behavioral offer a window into decoding not just what one is doing but how one is thinking and feeling. At the simplest level, this could entail determining who is talking to whom about what and how using automated audio and video analysis of verbal and nonverbal behavior. Computational behavioral modeling can also target more complex, higher level constructs, like human emotions. Behavioral signals combined with physiological signals of individuals (heart rate, respiration, skin conductance) as well as of their environment, offer further possibilities for understanding dynamic cognitive, affective and physical human states in context. Machine learning could also help detect, analyze and model deviation from what is deemed typical.

This talk will introduce this domain of human-centered signal processing and machine learning using active interdisciplinary collaborative research in the broad area of Behavioral Machine Intelligence and Informatics. Using research case studies that leverage multimodal sensing, signal processing, machine learning and bio-behavioral sciences, the talk will explore the complex interplay between individual difference variables (e.g., personality), mental states (e.g., emotions) and well-being (e.g., fatigue) in performance of jobs with varying cognitive, affective, and social demands in complex work environments such as hospitals.

Biography of the Speaker:

Shrikanth (Shri) Narayanan the Niki & C. L. Max Nikias Chair in Engineering at the University of Southern California, where he is Professor of Electrical & Computer Engineering, and jointly in Computer Science, Linguistics, Psychology, Neuroscience, Otolaryngology and Pediatrics, Director of the Ming Hsieh Institute and Research Director of the Information Sciences Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies. He is a Fellow of the National Academy of Inventors, the Acoustical Society of America, IEEE, ISCA, the American Association for the Advancement of Science (AAAS), the Association for Psychological Science, and the American Institute for Medical and Biological Engineering (AIMBE). He is a recipient of several honors including the 2015 Engineers Council’s Distinguished Educator Award, a Mellon award for mentoring excellence, the 2005 and 2009 Best Journal Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11, a 2018 ISCA Best Journal Paper award, and serving as an ISCA Distinguished Lecturer for 2015-16 and the Willard R. Zemlin Memorial Lecturer for ASHA in 2017. He has published over 800 papers and has been granted seventeen U.S. patents. His research and inventions have led to technology commercialization including through startups he co-founded: Behavioral Signals Technologies focused on the telecommunication services and AI based conversational assistance industry and Lyssn focused on mental health care delivery, treatment and quality assurance.

June. 2nd 2019, Sun. 12:00 , Oren Freifeld (webpage).

Ben-Gurion University.

Location: Nano Building (206), Room B991.

Geometric transformations in deep learning and Bayesian nonparamteric mixture models


This talk will focus on two of the main research directions at our Vision, Inference, and Learning group: 1) Geometric transformations in deep learning and 2) Bayesian nonparamteric mixture models.

During the talk, which is based on both published and under-review works, I will touch upon applications in computer vision, machine learning, and time-series analysis.

About the speaker:

Oren Freifeld is a faculty member at Ben-Gurion University Computer Science Department. Previously, he was at MIT CSAIL (postdoc), Brown University Applied Math (PhD, ScM), Stanford Electrical Engineering (Visiting PhD Student), and Max Planck Institute for Intelligent Systems (Visiting PhD Student). His research focuses on practical and mathematically-principled tools for high-dimensional data analysis, particularly those that scale gracefully with the data's size, and that adapt model complexity to the data. He is mostly interested in Bayesian and/or geometric methods, and in problems such as unsupervised learning, motion analysis, segmentation, statistical image models, signal alignment, and deep learning.

May. 26th 2019, Sun. 12:00 , Students Session (Joseph Keshet & Gal Chechik).

Bar-Ilan University.

Location: Nano Building (206), Room B991.

May. 12th 2019, Sun. 12:00 , Lihi Zelnik-Manor and Asaf Noy (webpage).

Technion - Israel Institute of Technology and Alibaba.

Location: Nano Building (206), Room B991.

AutoML by Alibaba: Architecture Search, Anneal and Prune


Manual network architecture search and hyper-parameter tuning of neural network based algorithms could be a very interesting task but also arduous. Automatic methods for Neural Architecture Search (NAS) target this, aiming to automate the process. In this talk, we will present novel methods developed at Alibaba DAMO Israel lab for automating NAS. With these methods, we are able to reach SotA results on various image classification datasets within 1 GPU day per dataset.

Apr. 29th 2019, Sun. 11:00 , Shai Ben David (webpage).

University of Waterloo.

Location: CS Building (216), Room 201 .

Near-optimal Sample Complexity Bounds for Robust Learning of Gaussians Mixtures via Compression Schemes


We prove that Θ(kd2 /ε^2 ) samples are necessary and sufficient for learning a mixture of k Gaussians in R^d , up to error ε in total variation distance. This improves both the known upper bounds and lower bounds for this problem. For mixtures of axis-aligned Gaussians, we show that O(kd/ε^2 ) samples suffice, matching a known lower bound. Moreover, these results hold in the agnostic-learning/robust-estimation setting as well, where the target distribution is only approximately a mixture of Gaussians. The upper bound is shown using a novel technique for distribution learning based on a notion of compression. Any class of distributions that allows such a compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in R^d admits a small-sized compression scheme.

(Joint work with Hassan Ashtiani, Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian and Yaniv Plan)

Apr. 28th 2019, Sun. 12:00 , Raja Giryes (webpage).

Tel-Aviv University.

Location: Nano Building (206), Room B991.

Using information theory for deep learning


In this talk, we will use two tools in information theory to gain a better understanding of deep learning training.

First, we will describe the problem of analog channel coding and see how it may explain the dropout regularization.

Second, we will discuss the Lautum information and see how it may help in the problem of semi-supervised transfer learning.

The works described in this talk are done in collaboration with Dor Bank, Daniel Jakubovitz and Miguel Rodriguez.

Apr. 7th 2019, Sun. 12:00 , Aviv Tamar (webpage).

Technion - Israel Institute of Technology.

Location: Nano Building (206), Room B991.

Learning Representations for Planning


How can we build autonomous robots that operate in unstructured and dynamic environments such as homes or hospitals?

This problem has been investigated under several disciplines, including planning (motion planning, task planning, etc.), and reinforcement learning. While both of these fields have witnessed tremendous progress, each have fundamental drawbacks when it comes to autonomous robots. In general, planning approaches require substantial manual engineering in specifying a model for the domain, while RL is data hungry and does not generalize beyond the tasks seen during training.

In this talk, we present several studies that aim to mitigate these shortcomings by combining ideas from both planning and learning. We start by introducing value iteration networks, a type of differentiable planner that can be used within model-free RL to obtain better generalization. Next, we consider a practical robotic assembly problem, and show that motion planning, based on readily available CAD data, can be combined with RL to quickly learn policies for assembling tight fitting objects. We conclude with our recent work on learning to imagine goal-directed visual plans. Motivated by humans’ remarkable capability to predict and plan complex manipulations of objects, we develop a data-driven method that learns to ‘imagine’ a plausible sequence of observations that transition a dynamical system from its current configuration to a desired goal state. Key to our method is Causal InfoGAN, a deep generative model that can learn features that are compatible with strong planning algorithms. We demonstrate our approach on learning to imagine and execute robotic rope manipulation.

Mar. 31st 2019, Sun. 12:00 , Sivan Sabato (webpage).

Ben-Gurion University.

Location: Nano Building (206), Room B991.

A Universally Consistent 1-Nearest-Neighbor Algorithm


We show a 1-Nearest-Neighbor algorithm that is universally

strongly-Bayes-consistent in all metric spaces where such a learner exists.

This is the first learning algorithm known to enjoy this property.

Joint work with Steve Hanneke, Aryeh Kontorovich, and Roi Weiss.

Mar. 17th 2019, Sun. 12:00 , Tomer Galanti (webpage).

Tel-Aviv University (PhD Student).

Location: Nano Building (206), Room C563.

Generalization Bounds for Unsupervised Image to Image translations with WGANs


The recent empirical success of cross-domain mapping algorithms, between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such mappings.

We work with the adversarial training method called the Wasserstein GAN and derive a novel generalization bound, which limits the risk between the learned mapping $h$ and the target mapping $y$, by a sum of two terms: (i) the risk between $h$ and the most distant alternative mapping that was learned by the same cross-domain mapping algorithm, and (ii) the minimal Wasserstein GAN divergence between the target domain and the domain obtained by applying a hypothesis $h^*$ on the samples of the source domain, where $h^*$ is a hypothesis selected by the same algorithm. The bound is directly related to Occam's razor and it encourages the selection of the minimal architecture that supports a small Wasserstein GAN divergence.

The bound leads to multiple algorithmic consequences, including a method for hyperparameter selection and for an early stopping in cross-domain mapping GANs. We also demonstrate a novel capability for unsupervised learning of estimating confidence in the mapping of every specific sample. Lastly, we show how non-minimal architectures can be effectively trained by an inverted knowledge distillation in which a minimal architecture is used to train a larger one, leading to higher quality outputs.

Bio: A PhD student at Tel Aviv University, under the supervision of Prof. Lior Wolf, with a focus on the theoretical aspects of unsupervised learning and deep learning.

Mar. 10th 2019, Sun. 12:00 , Yair Weiss (webpage).

The Hebrew University of Jerusalem.

Location: Gonda Building (901), Room 101.

Why do deep convolutional networks generalize so poorly to small image transformations?


Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore, we see these failures to generalize more frequently in more modern networks. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.

Mar. 3rd 2019, Sun. 12:00 , Daniel Soudry (webpage).

Technion - Israel Institute of Technology.

Location: Nano Building (206), 5th floor, Cyber Center Meeting Room.

Theoretical and Empirical Investigation of Several Common Practices in Deep Learning


We examine several empirical and theoretical results on the training of deep networks. For example,

  • Why are common "over-fitting" indicators (e.g., very low training error, high validation loss) misleading?
  • Why, sometimes, early-stopping time never arrives?
  • Why can adaptive rate methods (e.g., adam) degrade generalization?
  • Why commonly used loss functions exhibit better generalization than others?
  • Why use weight decay before batch-norm?
  • When can we use low numerical precision, and how low can we get?

and discuss the practical implications of these results.


Since October 2017, Daniel soudry is an assistant professor (Taub Fellow) in the Department of Electrical Engineering at the Technion, working in the areas of machine learning and theoretical neuroscience. Before that, he did his post-doc (as a Gruss Lipper fellow) working with Prof. Liam Paninski in the Department of Statistics, the Center for Theoretical Neuroscience the Grossman Center for Statistics of the Mind at Columbia University. He did his Ph.D. in the Department of Electrical Engineering at the Technion, Israel Institute of technology, under the guidance of Prof. Ron Meir. He received his B.Sc. degree in Electrical Engineering and Physics from the Technion.

Feb. 24th 2019, Sun. 12:00 , Sagie Benaim (webpage).

Tel-Aviv University (PhD Student).

Location: Gonda Building (901), Big Auditorium Downstairs.

New Capabilities in Unsupervised Image to Image Translation


In Unsupervised Image to Image Translation, we are given an unmatched set of images from domain A and domain B, and our task is to generate, given an image from domain A, its analogous image in domain B.

In the first part of the talk, I'll describe a new capability which allows us to perform such translation, where only a single image is present in domain A. Specifically, given a single image x from domain A and a set of images from domain B, our task is to generate the analogous of x in B. We argue that this task could be a key AI capability that underlines the ability of cognitive agents to act in the world and present empirical evidence that the existing unsupervised domain translation methods fail on this task.

In the second part of the talk, I'll describe a new capability which allows us to disentangle the "common" and "domain-specific" information of domains A and B. This allows us to generate, given a sample a in A and a sample b in B, an image in domain B which contains the "common" information of a and "domain-specific" information of b. For example, ignoring occlusions, B can be "people with glasses", A can be "people without". The "common" information is "faces" where the "domain-specific" information of B is "glasses". At test time, we add the glasses of person in domain B to any person in domain A.

Lastly, time permitting, I'll describe the application of these techniques in the context of Singing Voice Separation, where the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music.

Jan. 6th 2019, Sun. 12:00 , Roy Bar-Haim (webpage).

Debating Technologies group, IBM Research AI - Haifa.

Location: Gonda Building (901), Room 101.

Stance Classification and Sentiment Analysis in IBM Project Debater


Project Debater is the first AI system that was shown to debate humans in a meaningful manner in a full live debate. Developing this system started in 2012, as the next AI Grand Challenge pursued by IBM Research, following the demonstration of Deep Blue in Chess in 1997, and Watson in Jeopardy! In 2011. The system was revealed in June 2018, in two full live debates against expert human debaters, and received massive media attention.

In this talk I will first give a high-level view of the project and its core technologies. I will then focus on one of its most challenging parts – understanding the stance of arguments. I will survey several of our works on stance classification and sentiment analysis of arguments, which resulted in several publications, language resources and datasets.

In the last part of the talk, I will present our recent work on learning sentiment composition, a fundamental sentiment analysis problem. Previous work relied on

manual rules and manually-created lexical resources such as negator lists, or learned a composition function from sentiment-annotated phrases or sentences. We propose a new approach for learning sentiment composition from a large, unlabeled corpus, which only requires a word-level sentiment lexicon for supervision.


Roy Bar-Haim is a Research Staff Member in IBM Research – Haifa. Over the last six years, he has been leading a global team of research scientists working on core components in Project Debater. Roy also serves as the Haifa lab’s co-chair of the Natural Language Processing Professional Interests Community (NLP PIC). Before joining IBM, he led NLP and ML research teams in several startups. He has published in, and reviewed for, top NLP and AI conferences and journals. He serves on the elite standing reviewer team of TACL (Transactions of the Association for Computational Linguistics) and was an area co-chair at the COLING 2016 conference. Roy received his B.Sc and M.Sc degrees from the Technion, and his Ph.D from Bar-Ilan University, all in computer science.

Dec. 16th 2018, Sun. 12:00 , Aryeh Kontorovich (webpage).

Ben-Gurion University.

Location: Gonda Building (901), Room 101.

Vignettes on sample compression


Sample compression is a natural and elegant learning framework, which allows for storage and runtime savings as well as sharp generalization bounds. In this talk, I will survey a few recent collaborations that touch upon various aspects of sample compression. Central among these is the development of a new algorithm for learning in arbitrary metric spaces based on a margin-regularized 1-nearest neighbor, which we call OptiNet. The latter is strongly universally Bayes-consistent in all essentially-separable metric probability spaces. OptiNet is the first learning algorithm to enjoy this property; by comparison, k-NN and its variants are not Bayes-consistent, except under additional structural assumptions, such as an inner product, a norm, finite doubling dimension, or a Besicovitch-type property. I will then talk about sample compression in the context of regression, extensions to non-uniform margins, and, time permitting, generalization lower bounds.

Nov. 25th 2018, Sun. 12:00 , Tamir Hazan (webpage).

Technion - Israel Institute of Technology.

Location: Gonda Building (901), Room 101.

Direct Optimization through argmax for Discrete Variational Auto-Encoder


Reparameterization of variational auto-encoders is an effective method for reducing the variance of their gradient estimates. However, when the latent variables are discrete, a reparameterization is problematic due to discontinuities in the discrete space. In this work, we extend the direct loss minimization technique to discrete variational auto-encoders. We first reparameterize a discrete random variable using the arg max function of the Gumbel-Max perturbation model. We then use direct optimization to propagate gradients through the non-differentiable arg max using two perturbed arg max operations.

Nov. 11th 2018, Sun. 12:00 , Ohad Shamir (webpage).

Weizmann Institute of Science.

Location: Gonda Building (901), Room 101.

Optimization Landscape of Neural Networks: Where Do the Local Minima Hide?


Training neural networks is a highly non-convex optimization problem, which is often successfully solved in practice, but the reasons for this are poorly understood. Much recent work has focused on showing that these non-convex problems do not suffer from poor local minima. However, this has only been provably shown under strong assumptions or in highly restrictive settings. In this talk, I’ll describe some recent results on this topic, both positive and negative. On the negative side, I’ll show how local minima can be ubiquitous even when optimizing simple, one-hidden-layer networks under favorable data distributions. On the flip side, I’ll discuss how looking at other architectures (such as residual units), or modifying the question, can lead to positive results under mild assumptions.

June 21st 2018, Thu 10:00 , Amir Globerson (webpage).

Tel Aviv University (Faculty).

Location: Gonda Building (901), Room 101.

Deep Learning: Optimization, Generalization and Architectures


Three key challenges in deep learning are: understanding why optimization works despite non-convexity, understanding why generalization is possible despite training very large models with limited data, and understanding architecture design. In this talk I will discuss our recent work on these questions.

June 14th 2018, Thu 10:00 , Eran Malach (webpage) (slides).

The Hebrew University of Jerusalem (PhD Student)

Location: Gonda Building (901), Room 101.

A Provably Correct Algorithm for Deep Learning that Actually Works


We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm. Our algorithm stems from a deep generative model that generates images level by level, where lower resolution images correspond to latent semantic classes. We analyze the convergence rate of our algorithm assuming that the data is indeed generated according to this model (as well as additional assumptions). While we do not pretend to claim that the assumptions are realistic for natural images, we do believe that they capture some true properties of real data. Furthermore, we show that our algorithm actually works in practice (on the CIFAR dataset), achieving results in the same ballpark as that of vanilla convolutional neural networks that are being trained by stochastic gradient descent.

June 13th 2018, Mon. 12:00 , Yinyin Liu.

Head of Data Science, Intel AIPG.

Location: Building 216, Room 201.


The Intel AI Lab, within the AI Products Group was formed last year with the goal of pursuing fundamental and applied AI research, and the long term vision of building brain-like capabilities. The lab is focused on developing and implementing state of the art algorithms in topics such as natural language processing, vision, audio, reinforcement learning, recommendation systems, and robotic learning. Key vertical areas include autonomous driving, federal and retail. We partner both internally with Intel Labs and other groups, and externally with universities and companies. The output includes open source software releases and publishing at top AI conferences. The work also helps our partner Intel teams build better hardware and software products, and marketing demos. In this talk I will give an overview of the AI Lab.

Yinyin’s BIO

Yinyin Liu is the head of data science for AIPG at Intel, where she works with a team of data scientists on applying deep learning and Intel Nervana technologies to business applications across different industry domains and driving the development and design of the Intel Nervana platform. She and the Intel Nervana team have developed open source deep learning frameworks, such as neon and Intel Nervana Graph, bringing state-of-the-art models on image recognition, image localization, and natural language processing into the frameworks. Yinyin has research experience in computer vision, neuromorphic computing, and robotics.

June 11th 2018, Mon. 11:00 , Zachary Chase Lipton (webpage).

Carnegie Mellon University (CMU).

Location: Gonda Building (901), Room 101.

Detecting and Correcting for Label Shift with Black Box Predictors


Faced with distribution shift between training and test set, we wish to detect and quantify the shift, and to correct our classifiers without test set labels. Motivated by medical diagnosis, where diseases (targets), cause symptoms (observations), we focus on label shift, where the label marginal p(y) changes but the conditional p(x|y) does not. We propose Black Box Shift Estimation (BBSE) to estimate the test distribution p(y). BBSE exploits arbitrary black box predictors to reduce dimensionality prior to shift correction. While better predictors give tighter estimates, BBSE works even when predictors are biased, inaccurate, or uncalibrated, so long as their confusion matrices are invertible. We prove BBSE's consistency, bound its error, and introduce a statistical test that uses BBSE to detect shift. We also leverage BBSE to correct classifiers. Experiments demonstrate accurate estimates and improved prediction, even on high-dimensional datasets of natural images.


Zachary Chase Lipton is an assistant professor at Carnegie Mellon University. His research spans both core machine learning methods and their social impact. concentrating on machine learning for healthcare, data-efficient deep learning, temporal structure, and learning under domain adaptation. This work addresses diverse application areas, including diagnosis, dialogue systems, and product recommendation. He is the founding editor of the Approximately Correct blog and the lead author of Deep Learning – The Straight Dope, an open-source interactive book teaching deep learning through Jupyter notebooks. Find on Twitter (@zacharylipton) or GitHub (@zackchase).

June 7th 2018, Thu 10:00 , Uri Shalit (webpage).

Technion – Israel Institute of Technology (Faculty).

Location: Gonda Building (901), Room 101.

Learning to Act from Observational Data: Machine Learning and Causal Inference in Healthcare


The proliferation of data collection in the health, commercial, and economic spheres, brings with it opportunities for extracting new knowledge leading to concrete policy implications. An example that motivates my research is using electronic healthcare records to individualize medical practices.

The scientific challenge lies in the fact that standard prediction models such as supervised machine learning are often not enough for decision making from this so-called “observational data”: Supervised learning does not take into account causality, nor does it account for the feedback loops that arise when predictions are turned into actions. On the other hand, existing causal-inference methods are not adapted to dealing with the rich and complex data now available, and often focus on populations, as opposed to individual-level effects.

In my talk, I will discuss the challenges of applying machine learning in the clinical healthcare setting, and show how we apply recent ideas from machine learning and specifically deep-learning to individual-level causal-inference and action.

May 31st 2018, Thu 10:00 , Or Sharir (webpage) (slides).

The Hebrew University of Jerusalem (PhD Student)

Location: Gonda Building (901), Room 101.

On the Expressive Power of ConvNets and RNNs as a Function of their Architecture


The driving force behind convolutional and recurrent networks — two of the most successful deep learning architectures to date — is their expressive power. Despite its wide acceptance and vast empirical evidence, formal analyses supporting this belief are scarce. The primary notions for formally reasoning about expressiveness are efficiency and inductive bias. Efficiency refers to the ability of a network architecture to realize functions that require an alternative architecture to be much larger. Inductive bias refers to the prioritization of some functions over others given prior knowledge regarding a task at hand. Through an equivalence to hierarchical tensor decompositions, we study the expressive efficiency and inductive bias of various architectural features in convolutional networks (depth, width, pooling geometry, inter-connectivity, overlapping receptive fields etc.) as well as the long-term memory capacity of deep recurrent networks. Our results shed light on the demonstrated effectiveness of modern networks, and in addition, provide new tools for network design.

May 3rd 2018, Thu 10:00 , Jonathan Berant (webpage).

Tel Aviv University (Faculty).

Location: Gonda Building (901), Room 101.

Talking to your Virtual Assistant about anything


Conversational interfaces and virtual assistants are now part of our lives due to services such as Amazon Alexa, Google Voice, Microsoft Cortana, etc. Thus, translating natural language queries and commands into an executable form, also known as semantic parsing, is one of the prime challenges nowadays in natural language understanding. In this talk I would like to highlight the main challenges and limitations in the field of semantic parsing, and to describe ongoing work that addresses those challenges. First, semantic parsers require information to be stored in a knowledge-base, which substantially limits their coverage and applicability. Conversely, the web has huge coverage but search engines that access the web do not handle well language compositionality. We propose to treat the web as a KB and compute answers to complex questions in broad domains by decomposing the question into a sequence of simple questions, extract answers with a search engine, and recompose the answers to obtain a final result. Second, deploying virtual assistants in many domains (cars, homes, calendar, etc.) requires the ability to quickly develop semantic parsers. However, most past work trains semantic parsers from scratch for any domain, while disregarding training data from other domains. We propose a zero-shot approach for semantic parsing, where we decouple the structure of language from the contents of the domain and learn a domain-independent semantic parser.


Dr. Jonathan Berant is a senior lecturer in The School of Computer Science since October 2016 working on various natural language understanding problems. Jonathan got his PhD from Tel-Aviv University in 2012 and has been a post-doctoral fellow at Stanford’s Computer Science Department from 2012 to 2015. Jonathan was also a post-doctoral fellow at Google Research from 2015 to 2016. Jonathan was an Azrieli fellow and an IBM fellow during his graduate studies, and a Rothschild fellow during his post-doctoral period. His work has been recognized by a best paper award (authored by a student) in ACL 2011, a best paper award in EMNLP 2014, and also best paper nominations in ACL 2013 and ACL 2014. Since being appointed as senior lecturer at Tel-Aviv University he has won grants from the ISF (2016), BSF (2017), and Samsung runway project (2017).

Apr 26th 2018, Thu 10:00 , Idan Schwartz (webpage).

Technion – Israel Institute of Technology (PhD Student).

Location: Gonda Building (901), Room 101.

High-Order Attention Models for Visual Question Answering


The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

Mar 22nd 2018, Thu 10:00 , Eliya Nachmani.

Facebook AI Research.

Location: Gonda Building (901), Room 101.

Synthesis and Cloning Human Voices


Text to speech (TTS) is able to transform text to speech. In this talk we present a new neural TTS for voices that are sampled in the wild. We introduce a new network architecture - VoiceLoop which is simpler than those in the existing literature and is based on a novel shifting buffer working memory. Our solution is able to deal with unconstrained voice samples and without requiring aligned phonemes or linguistic features. We also show how we can control the emotion variability in the generated speech by priming the network buffer.

We further propose a TTS systems have the potential to generalize from one speaker to another with relatively short sample of any new voice. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

Mar 15th 2018, Thu 10:00 , Ofra Amir (webpage).

Technion – Israel Institute of Technology (Faculty).

Location: Gonda Building (901), Room 101.

Distilling relevant information to support human-human and human-agent collaboration


One of today's biggest challenges is the heightened complexity and information overload stemming from increasingly interacting systems, consisting of both humans and machines. In this talk, I will describe work that aims to address this challenge in two settings: human teamwork and human-agent collaboration. In the context of human teamwork, I will present our work towards developing intelligent systems that reduce coordination overhead in distributed teams by personalizing the information that is shared with team members. We developed an algorithm that determines what information about others’ activities is most relevant to each team member, and show through a user study that such personalized information sharing resulted in higher productivity and reduced workload of team members, without detrimental effects on the quality of the team’s work.

In the context of human-agent collaboration, I will describe our work towards the development of methods for summarizing agent behavior, with the goal of enabling users to better understand the capabilities of agents they interact with. We developed “HIGHLIHGTS'”, an algorithm that extracts important trajectories from the execution trace of an agent to generate a succinct description of key agent behaviors. Our experiments show that study participants were more successful at assessing agents’ capabilities when shown summaries generated by HIGHLIGHTS compared to baseline summaries.

Jan 4th 2018, Thu 16:00 , Yonatan Belinkov (webpage).

Massachusetts Institute of Technology (PhD Student).

Location: Building 211, room 101.

Understanding Internal Representations in Deep Learning Models for Language and Speech Processing


Language technology has become pervasive in everyday life, powering applications like Apple’s Siri or Google’s Assistant. Neural networks are a key component in these systems thanks to their ability to model large amounts of data. Contrary to traditional systems, models based on deep neural networks (a.k.a. deep learning) can be trained in an end-to-end fashion on input-output pairs, such as a sentence in one language and its translation in another language, or a speech utterance and its transcription. The end-to-end training paradigm simplifies the engineering process while giving the model flexibility to optimize for the desired task. This, however, often comes at the expense of model interpretability: understanding the role of different parts of the deep neural network is difficult, and such models are often perceived as “black-box”. In this work, we study deep learning models for two core language technology tasks: machine translation and speech recognition. We advocate an approach that attempts to decode the information encoded in such models while they are being trained. We perform a range of experiments comparing different modules, layers, and representations in the end-to-end models. Our analyses illuminate the inner workings of end-to-end machine translation and speech recognition systems, explain how they capture different language properties, and suggest potential directions for improving them. The methodology is also applicable to other tasks in the language domain and beyond.

Jan 2nd 2018, Tue 14:00, Masashi Sugiyama (website).

RIKEN / The University of Tokyo (Faculty).

Location: Gonda Building (901), Room 101.

Machine Learning from Weak Supervision - Towards Accurate Classification with Low Labeling Costs


Machine learning from big training data is achieving great success. However, there are various application domains that prohibit the use of massive labeled data. In this talk, I will introduce our recent advances in classification from weak supervision, including classification from two sets of unlabeled data, classification from positive and unlabeled data, a novel approach to semi-supervised classification, and classification from complementary labels. Finally, I will briefly introduce the activities of RIKEN Center for Advanced Intelligence Project.


Prof Sugiama is the director of the RIKEN center for advanced intelligence (AIP) and a professor at the department of complexity science and engineering at the University of Tokyo.

He was born in Osaka, Japan, in 1974. He received the degrees of Bachelor of Engineering, Master of Engineering, and Doctor of Engineering in Computer Science from Tokyo Institute of Technology, Japan in 1997, 1999, and 2001, respectively. In 2001, he was appointed Assistant Professor in the same institute, and he was promoted to Associate Professor in 2003. He moved to the University of Tokyo as Professor in 2014. From 2016, he concurrently serves as Director of RIKEN Center for Advanced Intelligence Project. He received an Alexander von Humboldt Foundation Research Fellowship and researched at Fraunhofer Institute, Berlin, Germany, from 2003 to 2004. In 2006, he received European Commission Program Erasmus Mundus Scholarship and researched at the University of Edinburgh, Edinburgh, UK. He received the Faculty Award from IBM in 2007 for his contribution to machine learning under non-stationarity, the Nagao Special Researcher Award from the Information Processing Society of Japan in 2011 and the Young Scientists' Prize for the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology Japan in 2014 for his contribution to the density-ratio paradigm of machine learning, and the Japan Society for the Promotion of Science Award and the Japan Academy Medal in 2017 for his series of machine learning research. His research interests include theories and algorithms of machine learning and data mining, and a wide range of applications such as signal processing, image processing, and robot control.

Dec 24th 2017, Sun 11:00, Yuval Pinter (webpage).

Georgia Institute of Technology (PhD Student).

Location: Cyber-Center Meeting Room.

Integrating Distributional and Compositional Approaches to Word Embeddings


In recent years, nearly all applications in Natural Language Processing have become dominated by Machine Learning methods that use dense, low-dimensional, word vectors (known as embeddings) as their building blocks, including Machine Translation, Question Answering, Sentiment Analysis, and many more. These word embeddings, particularly suited for use in neural nets, are typically obtained via techniques that stem from a distributional approach, i.e. learning to maximize vector similarity of words that tend to appear in similar contexts within large textual corpora. This powerful approach has its limits, a major one being representation of words not seen in the training corpus, known as the out-of-vocabulary (OOV) problem.

In my talk, I will present some approaches that tackle the OOV problem by complementing the distributional embedding methods with a compositional view of word structure. I will focus on our recent algorithm, MIMICK (EMNLP 2017), which produces OOV embeddings by re-learning distributionally-trained vectors using only the way words are spelled, using a character-level Recurrent Neural Net (RNN). I will show the merits of our model across a diverse array of 23 languages on a sequence-tagging task. I will discuss the implications of our results based on attributes of different languages and datasets, as well as some new findings relating to the architecture choices underlying the MIMICK model.

Dec 21st 2017, Thu 12:00 , Karen Livsecu (webpage).

Toyota Technological Institute at Chicago (Faculty).

Location: Building 216, Colloquium room.

How should we use domain knowledge in the era of deep learning? (A perspective from speech processing)


Deep neural networks are the new default machine learning approach in many domains, such as computer vision, speech processing, and natural language processing. Given sufficient data for a target task, end-to-end models can be learned with fairly simple, almost universal algorithms. Such models learn their own internal representations, which in many cases appear to be similar to human-engineered ones. This may lead us to wonder whether domain-specific techniques or domain knowledge are needed at all.

This talk will provide a perspective on these issues from the domain of speech processing. It will discuss when and how domain knowledge can be helpful, and describe two lines of work attempting to take advantage of such knowledge without compromising the benefits of deep learning. The main application will be speech recognition, but the techniques discussed are general.

Oct 25th 2017, Wed, 10:30, Omer Levy (website).

University of Washington (Post Doc).

Location: Gonda Building (901), Room 101.

What does an LSTM Learn?


Long short-term memory (LSTM) was designed to address the problem of vanishing gradients in a simple recurrent neural network (S-RNN) by introducing a memory cell that records information via addition. We observe, on a variety of natural language tasks, that replacing the embedded S-RNN with a simple linear transformation does not degrade performance, implying that the S-RNN's role in an LSTM is redundant. We conjecture that the modeling power of an LSTM stems directly from the memory cell, and examine the value that it stores at each iteration. Our analysis reveals that the memory cell dynamically computes an element-wise weighted sum over its inputs, suggesting that this more restricted function space is the main driving force behind the success of LSTMs.


I am a post-doc in the Department of Computer Science & Engineering at the University of Washington, working with Prof. Luke Zettlemoyer. Previously, I completed my PhD at Bar-Ilan University with the guidance of Prof. Ido Dagan and Dr. Yoav Goldberg. I am interested in designing algorithms that mimic the basic language abilities of humans, and using them to realize semantic applications such as question answering and summarization that help people cope with information overload. I am also interested in deepening our qualitative understanding of how machine learning is applied to language and why it succeeds (or fails), in hope that better understanding will foster better methods.