Currently, the primary focus of my research is information retrieval. Two particular cases that I am focussing on are retrieval of experiments, a collection of covariate-outcome measurement pairs, and retrieval of metagenomic samples, a collection of sequencing reads. I have also been developing probabilistic and Bayesian matrix factorization tools, particularly for archetypal analysis.
During PhD, the primary focus of my research have been to develop nonparametric tools to measure (statistical) independence, dependence, conditional independence, and conditional dependence between random variables, and to apply them in independent component analysis, variable selection, causal inference, and network connectivity analysis, such as exploring functional and effective connectivity from EEG recording. I have also been interested in developing measures of divergence between point processes for hypothesis testing, and designing strictly positive definite kernels on spike trains for neural decoding. I have also worked on information theoretic learning, adaptive filtering, echo state networks and compressed sensing.
Fields of interest
- Information retrieval
- Information retrieval deals with finding relevant objects given a query object: a widely popular example is document retrieval. However, the same principles are applicable to other areas. My research focus on extending these ideas to experiments, and metagenomics. Experiment can be seen as a set of measurements over certain covariates and outcomes. My focus is on designing appropriate similarity measures between two experiments that can be applied to a general setting.
- Metagenomics is the study of genomic data sequenced directly from an environment samples. This differs from the standard approach of studying a single genome. One of the primary aspects of metagenomics is to explore what species are present in the sample. This requires a computationally expensive step of 'assembly' that forms larger genomic fragments from reads, that can be identified. My research deals with developing computational tools in extracting relevant information from a group of metagenomic samples without assembly.
- Archetypal analysis
- Archetypal analysis is an exploratory tool. Given a set of observations, it finds a few 'ideal' observations, called archetypes, and expresses the rest as convex combinations of the archetype. In other words, it finds an approximate convex hull of the observations where the vertices are tagged to be archetypes. Archetypal analysis is different from widely used principal component analysis due to its expression of observations and convex combination: it is not necessarily a dimensionality reduction technique since number of vertices can be more that dimensionality of the observations. The closest relative to archetypal analysis is k-means which view other samples as variations of certain prototypes. My research focusses on developing a probabilistic framework that allows archetypal analysis for non-real observations, such as nominal.
- Independence measure
- A measure of independence is a statistic of two random variables that assumes zero value if and only if the random variables are independent. The primary focus of this research area is to develop estimators that can infer independence as accurately as possible from as less number of samples as possible. A practical application of such measures is independent component analysis, where these measures can be used as cost functions to be minimized. However, from an application perspective these measures should also be easy to evaluate and should not involve free parameters. My research focuses on designing measures that are both accurate and efficient in the context of independent component analysis.
- Dependence measure
- Dependence is often understood as mere absence of independence. However, this is not true, since a measure of independence only provides a binary answer whereas a measure of dependence is expected to quantify the strength of how two variables are related. The two most popular measures of dependence are correlation and mutual information. However, correlation only captures linear relationship whereas mutual information is difficult to estimate in practice. Moreover, although most of the available dependence measures explains what dependence is in the context of two random variables - often following the postulates proposed by Renyi - the corresponding estimators do not provide an intuitive understanding of what makes a set of realizations dependent. My research focuses on developing new understanding of dependence in the context of realizations in arbitrary metric spaces, and establish new estimators of dependence for practical applications such as variable selection, and assessing dependence between more exotic signals, such as between sets of spike train observations.
- Conditional independence measure
- A measure of conditional independence is a statistic of three random variables that assumes zero value if and only if the random variables are conditionally independent. Detecting conditional independence is a similar but slightly difficult problem than detecting just independence, since it requires estimating the conditional probability law. An application of such measures if Granger non-causal inference. A time series is said to not cause another time series, if given the past values of the latter time series, its present value if conditionally independent of the past values of the former time series. However, due to the difficulty of estimating conditional independence, in practice, Granger causality only deals with causality in mean. My research focus on designing measures of conditional independence to infer Granger non-causality by considering the entire probability law and not just the mean.
- Conditional dependence measure
- Similar to the concept of dependence, conditional dependence captures the strength of relationship between two variables given the value of a third random variable. A typical application of this concept if to quantify causal flow in a network, i.e., given the value of the rest of the nodes how much one node causally drives another. The most popular measure of conditional dependence is conditional mutual information. However, it is difficult to estimate. My research focuses on building efficient, parameter-free estimators of conditional dependence to quantify causal flow.
- Divergence measure
- A measure of divergence is a function of two probability laws that is zero if and only if the probability laws are the same. I am interested in estimating divergence between two point processes which can be treated as probability laws over space of spike trains. A typical application of such measures is non-stationary detection. However, the difficulty of constructing such measures is that the space of spike trains lack Euclidean structure. My research focuses on designing efficient divergence measures that bypasses this inherent difficulty.
- Kernel design
- A related problem is designing strictly positive definite kernels on the space of spike trains since such kernels inherently lead to a measure of divergence, and once again, the difficulty is the lack of Euclidean structure of such space. Another application of such kernels is neural decoding which is essentially a regression problem from spike train observation to the real line, and therefore, can be solved using kernel ridge regression. However, from an application perspective the kernels need to be simpler to evaluate and a good similarity measure. My research focuses on designing kernels on spike trains that satisfy these desired aspects.