Complexity measures

Complexity measures for classification/clustering problems

The work from Ho and Basu (2002) was seminal in analyzing the difficulty of a classification problem by using descriptors, complexity measures, extracted from a learning dataset. One of the main purposes of the complexity measures (indexes) is to characterize the intrinsic difficulty of a classification problem represented by a given dataset. These indexes measure, for example, the statistics concerning the geometry of the data, the topology and the form of the classification boundaries (class separation).

In this context, in a previous work, we performed an analysis of the complexity of classifying cancer gene expression data. Such kind of data often present some characteristics that can have a negative impact in the generalization ability of the classifiers generated. Some of these properties are data sparsity and an unbalanced class distribution. To take into account these properties, we proposed new complexity measures.

Currently, we address the extension of these measures for the context of classification problems whose classes are unbalanced. We also work on the proposal of complexity indexes for clustering problems. In our research team, we analyze also the use of these measures in the context of active learning and clustering.

Main publications:

  • Ana Carolina Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. de Souto, and Tin Kam Ho. How complex is your classification problem? A survey on measuring classification complexity. CoRR, abs/1808.03591, 2018.
  • Victor H. Barella, Luís P. F. Garcia, Marcilio C. P. de Souto, Ana Carolina Lorena, and André de Carvalho. Data complexity measures for imbalanced classification tasks. In IJCNN, pages 1–8, 2018. doi : 10.1109/IJCNN.2018.8489661
  • Ana C. Lorena and Marcílio C. P. de Souto. On measuring the complexity of classification problems. In ICONIP, volume 9489 of LNCS, pages 158–167, 2015. doi: 10.1007/978-3-319-26532-2\_18
  • Ana Lorena, Ivan Costa, Newton Spolaôr, and Marcílio C. P. de Souto. Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing, 75(1):33–42, 2012. doi: 10.1016/j.neucom.2011.03.054