For me, scientific peaks are the moments when a simple governing rule is discovered to explain the behavior of a complex system of non-trivial interactions. This thread is woven through the following research projects.
Adjacency matrix of a social network with two communities
Computational-statistical trade-offs
Many computational problems that are encountered in recent years are of high-dimensional flavor. Let it be finding a partition of an online social network into sub-communities or computing PCA in an ill-posed biological setting of few samples and many variables. The life-belt in those situations comes in the form of a latent low-dimensional ground-truth that governs the distribution of the high-dimensional distribution. The input is typically given in a matrix form (either an adjacency matrix in the partitioning problem or a covariance matrix in PCA). The questions that interest me revolve around the computational-statistical trade-offs: (?) what amount of computational effort is needed to uncover the latent low-dimensional structure as the SNR changes (?) what algorithmic approaches are useful in different ranges of SNR (?) What insights carry over from theoretical models (say, random matrices) to real-world ones?
For more reading see this paper or this one
This work is supported by an Israeli Science Foundation (ISF) grant
Applications of the magical word2vec
Word2vec is a neural-network based algorithm to generate real-valued vector embedding of natural language words in a low dimensional space. Rather surprisingly, such a low-dimensional embedding has a useful property – typically words that share a similar context in the language feature high co-sine similarity between their vectors. This property is widely used in many Natural Language Processing applications. Even more surprising (to me), a similar property explains various computational-statistical trade-offs of the previous project. For example. “words” would be the vertices of the graph, the “context” would be their community and graph neighborhood, and vector embedding obtained, for example, by Semi-Definite Programming would have the desired co-sine similarity property (for a certain range of problem parameters).
We have used word2vec to develop a promising heuristic to estimate the quality of automatic machine translation. Our work differs from all existing work in the area in that it does not use any information about the distribution of the text that is being evaluated, nor performs any pre-training steps that assume access to similar texts or to expensive training data that contains human-scored examples.
In another project, joint with the Visual Perception Lab at BGU, we are studying various aspects of the way the human brain perceives facial expressions. We are developing Face2Vec which is a useful computational framework that learns an assignment of word-vectors to facial expressions, allowing a computer to perform various tasks like clustering, classification and search, in a way that was not possible before.
For more reading see this paper
User characterization in online-social networks
It is an easy task to accumulate large amounts of unlabeled data about users in on-line social networks (OSN). This data contains various features computed from the content that a user shares in his profile. It is harder to understand what latent inherent signals are found in this data and to what tasks are they useful. Researchers have used PCA to study unlabeled OSN data, across a variety of networks, and found that users can be characterized in a meaningful manner along two main axes: popularity and activity.
These findings raise various interesting research questions (?) The axes are orthogonal and hence uncorrelated (due to algebraic properties of the PCA). This is a rather surprising finding: the popularity is uncorrelated with the user's activity. Is there a hidden Simpson-type paradox in play (?) The way that the axes are derived from the PCA solution is rather ad-hoc and non-systematic. How to measure the extent to which the semantic interpenetration of the PCs is consistent with reality? How robust are the axes to the way the data was collected?
For more reading see this paper, this one or this one
Machine-Learning assisted drug design
Together with Barak Akabayov from the Chemistry department at BGU, we are working on machine-learning assisted drug design, which is at the front of global drug design research. Concretely, we are working on identifying various structures that can serve as good inhibitors and as a result be used as a new kind of antibiotics. We are using publicly available big-data and self-generated data to train various machine-learning modules that will predict where and what chemical structures will be most useful.
For more reading see this paper.
Smart Transportation
Together with Eran Ben-Elia from the dept of Geography and Environmental Development at BGU we are leading a project that aims to compare objective parameters of quality of public transportation (e.g. GPS derived adherence to time-tables) and subjective ones (say, tweets complaining about delayed bus, or a crowded train). Our goal is to generate an overlaid map that can give decision makers an informative and powerful tool for public transportation improvement.
This project is funded by an internal BGU grant from the Smart Transportation center at BGU. For more reading click here.