KDD2015 Workshop on Learning from Small Sample Sizes

Overview

The small sample size ( or "large-p small-n") problem is a perennial in the world of Big Data. A frequent occurrence in medical imaging, computer vision, omics and bioinformatics it describes the situation where the number of features p, in the tens of thousands or more, far exceeds the sample size n, usually in the tens. Datamining, statistical parameter estimation, and predictive modelling are all particularly challenging in such a setting.

Moreover in all fields where the large-p small-n problem is a sensitive issue (and actually also in many others) current technology is moving towards higher resolution in sensing and recording while, in practice, sample size is often bounded by hard limits or cost constraints. Meanwhile even modest improvements in performance for modelling these information-rich complex data promise significant cost savings or advances in knowledge.

On the other hand it is becoming clear that "large-p small-n" is too broad a categorization for these problems and progress is still possible in the small sample setting either (1) in the presence of side information - such as related unlabelled data (semi-supervised learning), related learning tasks (transfer learning), or informative priors (domain knowledge) - to further constrain the problem, or (2) provided that data have low complexity, in some problem-specific sense, that we are able to take advantage of. Concrete examples of such low-complexity include: a large margin between classes (classification), a sparse representation of data in some known linear basis (compressed sensing), a sparse weight vector (regression), or a sparse correlation structure (parameter estimation). However we do not know what other properties of data, if any, act to make it "easy" or "hard" to work with in terms of the sample size required for some specific class of problems. For example: anti-learnable datasets in genomics are from the same domain as many eminently learnable datasets. Is anti-learnability then just a problem of data quality, the result of an unlucky draw of a small sample, or is there something deeper that makes such data inherently difficult to work with compared to other apparently similar data?

This workshop will bring together researchers from industry, academia, and other private and public institutions, working on different kinds of challenges where the common thread is the small sample size problem.

It will provide a forum for exchanging theoretical and empirical knowledge of small sample problems, and for sharing insight into which data structures facilitate progress on particular families of problems - even with a small sample size - and which do the opposite or when these break down.

A further specific goal of this workshop is to make a start on building links between the many disparate fields working with small data samples, with the ultimate aim of creating a multi-disciplinary research network devoted to this common issue.

Page updated

Google Sites

Report abuse