Current Projects

Scalable Descriptive Models for Large Volumes of Distributed Data

Supervisor: Murilo Coelho Naldi
Funding agency: São Paulo Research Foundation (FAPESP)

Abstract:

The increasing amount of data generated by current technologies turns analysis challenging. Firstly, a large part of the data is unknown (unlabeled) during its creation and, therefore, the relation between its objects is not explicit. Secondly, it is necessary that the methods used in the analysis to be scalable to the point of achieving their objectives, even with the increase in the amount of data analyzed. With such questions in mind, data clustering consists of unsupervised techniques that allow categorizing the data automatically, which is adequate for the analysis of such data. Through these techniques, it is possible to obtain a descriptive analysis of the data from information implicit in their relationships and the structures formed by them. However, traditional clustering techniques have been developed for small, static data sets. Its limitations do not always allow scalability, that is, its application in more massive, distributed data sets or even in data sets that are continually growing. This project aims to study clustering techniques applicable to incremental data sets. It is intended to achieve this objective through two research fronts: the first consists of adapting algorithms for scalable programming models, which allow the use of division and achievement for data access and distribution; the second consists of the study of clustering algorithms that generate a model and allow its adaptation as the data set is incremented, that is, the data are presented continuously to the algorithm.