The work from Ho and Basu (2002) was seminal in analyzing the difficulty of a classification problem by using descriptors, complexity measures, extracted from a learning dataset. One of the main purposes of the complexity measures (indexes) is to characterize the intrinsic difficulty of a classification problem represented by a given dataset. These indexes measure, for example, the statistics concerning the geometry of the data, the topology and the form of the classification boundaries (class separation).
In this context, in a previous work, we performed an analysis of the complexity of classifying cancer gene expression data. Such kind of data often present some characteristics that can have a negative impact in the generalization ability of the classifiers generated. Some of these properties are data sparsity and an unbalanced class distribution. To take into account these properties, we proposed new complexity measures.
Currently, we address the extension of these measures for the context of classification problems whose classes are unbalanced. We also work on the proposal of complexity indexes for clustering problems. In our research team, we analyze also the use of these measures in the context of active learning and clustering.