Mathematics for Big Data

An ECMI Special Interest Group

The availability of huge amounts of data is often considered as the fourth industrial revolution we are living right now. The increase in data accumulation allows us to tackle a wide range of social, economic, industrial and scientific challenges. But extracting meaningful knowledge from the available data is not a trivial task and represents a severe challenge for data analysts. Mathematics plays an important role in the existing algorithms for data processing through techniques of statistical learning, signal analysis, distributed optimization, compress sensing etc.

The amounts of data that are available and that are going to be available in near future call for significant efforts in mathematics. These efforts are needed to make the data useful. The main challenges we plan to consider within this SIG are, roughly speaking, in the area of mathematical optimization and statistics.

Minimization of a cost function, based on large amount of data is a typical problem in all big data areas – from smart agriculture, energy efficiency, computational biology, high tech industries based on simulations in material design to social networks, challenge in policy decisions based on data, risk assessment in finance, security, natural disasters etc. The challenges in these areas, mathematically speaking are design of algorithms that will be able to process huge amounts of data within a reasonable time span and with computer power that is widely available today. Two important issues are distributed optimization and privacy issues. Several EU documents cite privacy of data as an important question that is to be resolved. On the other hand, distributed optimization allows us to employ optimization techniques in parallel, at several different computers placed in networks of different types.

The extraction of meaningful information from data is one of the main tasks of Statistics. In presence of big data the most part of the usual techniques for statistical analysis can not easily been applied, since they are based on the simultaneous processing of the whole dataset.

A big effort has been made during these years, mainly by computer scientists, to find fast and scalable procedures that have become popular in presence of distributed architectures (like e.g. the well known MapReduce paradigm). Unfortunately in many situations such procedures can not be applied to solve statistical problems in a distributed way, or they work under too much restrictive and thus unrealistic conditions. The deepening of the mathematical insight in this context may help to better understand the theoretical and applied power of the new algorithms and to extend them to more realistic cases.

Sometimes data are “big” because of their high dimensionality and space-time structure (think e.g. to satellite images, signals registered by sensors or antennas, etc.). In such cases suitable mathematical techniques for dimensionality reduction are needed both for data visualization and for their numerical treatment. Functional Statistics, that is a field in which a lot of research is concentrating nowadays, may help in facing this task.

In other contexts data are considered “big” because of their complexity or heterogeneity (e.g. data extracted from social networks with text mining, mixed to socioeconomic data for marketing purposes; or data highly interrelated which may be represented by complex graphs, like atoms and bounds in a protein, relationships between users of a social network, etc.). Sentiment analysis and Topological Data Analysis are new statistical fields of research, still under development, which may help to tackle the problem of analyzing such data.

The aim of this Special Interest Group is to collect people working on the themes described above, coming both from academy and from “industry” (to be intended in a wide sense) to favor scientific collaboration and research, by organizing common activities like

1. Workshops or minisymposia with presentations of current problems by “Industry” and/or current methods by “Academia”

2. Awareness seminars, or study groups, or training for SME’s with

a. “industry” providing data/proposing test cases

b. academics running test cases with innovative methods

3. Scientific collaboration and joint research projects (possibly also on EU funds) among the members

Coordinators of the SIG: