During the first year of the project the team will be focused on modelling extremal events and anomaly detection in the univariate case. The methods will be divided in two main groups.
The first group contains methods for working with unimodal distributions. The tasks related with this group will be:
- To develop an accurate model for the tails of the observed distribution. Its solution is usually object of Extreme value theory and uses the mechanism of order statistics.
- To construct an accurate model for the center of the observed distribution and correctly to determine outliers. Thus, events which are expected to occur with probability less than some appropriate small threshold, could be considered as anomalous. For example, these could be some components in the mixture G = (1-e)F + eH, where F and H are cumulative distribution functions. We are particularly interested in the problem of finding parameter estimators of F. This task is an object of robust statistical data analysis.
The second group contains methods for modelling and estimation of all parameters of mixed probability distributions. In case of a mixture of two distributions, the difference with the previous approaches is that here one has to model both F and H. We are going to consider mainly the cases when both F and H are heavy-tailed, or when F is light-tailed, and H is heavy-tailed cumulative distribution function. At the end of the first year we will consider techniques for estimation of all parameters of mixed probability distributions with more than two groups.
During the second year, we expect to improve some existing techniques for modelling extremal events and anomaly detection in the multivariate case. Here, due to possible dependence in the structure of the data, more questions appear. Multivariate outliers behave differently than the majority of observations, which are assumed to follow some underlying model, like a multivariate normal distribution. We are going to focus on the task of identifying the most appropriate methods for declustering and dimensionality reduction of the data and in which cases those methods are applicable. In the case when the variance may not exist, we will apply techniques suggested recently by Filzmoser and co-authors for sparse and cellwise robust Principle component analysis. The authors substitute the squared loss function for the approximation error by a robust version. They use integration of a sparsity-inducing L1 or elastic net penalty, which offers additional modeling flexibility. In order to solve the resulting optimization problem, they develop an algorithm based on Riemannian stochastic gradient descent. The main advantage of this algorithm is that it is scalable to high-dimensional data, both in terms of many variables as well as observations. They call the resulting method SCRAMBLE (Sparse Cellwise Robust Algorithm for Manifold-based Learning and Estimation).
Along the project, we are going to depict the usefulness of the developed algorithms for anomaly detection in ecology and management.