Motivation and objectives       Topics of interest        Organization       Paper submission        Schedule        Abstracts
New! Presentation slides and other material added! See Abstracts.

Paper submission deadline:
extended Sun July 17, 2016 (was Mon, July 4, 2016)
 Paper acceptance notification:  Sun, Aug 14, 2016 (was Mon, July 25, 2016)
> Paper camera-ready deadline:  Fri, Sep 2, 2016 (was Mon, August 8, 2016
 Problem submission before
 Mon, September 12, 2016
 Workshop date:
 Mon, September 19, 2016

The 2nd ECML/PKDD 2016 workshop on

Statistically Sound Data Mining

Motivation and objectives

Following on from the previous successful ECML/PKDD workshop SSDM'14 we will again bring together researchers in this significant and topical field.

The field of statistics has developed sophisticated, well-founded methods for inference from data. While some of these place computational or practical limits that make them infeasible to apply directly to many data mining problems, the field of data mining has much to gain from a more sophisticated understanding of the strengths and limitations of these techniques and from greater utilization of them where they are appropriate.

As an answer to this dilemma, there is emerging a clear trend towards statistically sound data mining. The main impetus for this new trend is coming from a third party, the application fields. In the computerized world, it is easy to collect large data sets but their analysis is more difficult. Knowing the traditional statistical tests is no more sufficient for scientists, because one should first find the most promising hidden patterns and models to be tested. This means that there is an urgent need for efficient data mining algorithms which are able to find desired patterns, without missing any significant discoveries or producing too many spurious ones. A related problem is to find a statistically justified compromise between underfitted (too generic to catch all important aspects) and overfitted (too specific, holding just due to chance) patterns. However, before any algorithms can be designed, one should first solve many principal problems, like how to define the statistical significance of desired patterns, how to evaluate overfitting, how to interprete the p-values when multiple patterns are tested, and so on. In addition, one should evaluate the existing data mining methods, alternative algorithms and goodness measures to see which of them produce statistically valid results.

As we can see, there are many important problems which should be worked together with people from Data mining, Machine learning, and Statistics as well as application fields. The goal of this workshop is to offer a meeting point for this discussion. We want to bring together people from different backgrounds and schools of science, both theoretically and practically oriented, to specify problems, share solutions and brainstorm new ideas.

To encourage real workshopping of actual problems, the workshop is arranged in a novel way, containing an invited lecture and inspiring groupworks in addition to traditional presentations. This means that also the non-author participants can contribute to workshop results. If you have relevant problems which you would like to be worked together in the workshop, please send them before the workshop.

Topics of Interest

Topics of interest include but are not limited to:

  • Useful and relevant theoretical results
  • Search methods for statistically valid patterns and models
  • Statistical validation of discovered patterns
  • Evaluating statistical significance of clustering
  • Statistical techniques for avoiding overfitted patterns
  • Scaling statistical techniques to high-dimensionality and high data quantity, covering both theoretical problems (like multiple testing problem) and computational problems (calculating required test measures efficiently)
  • Interesting applications with real world data demonstrating statistically sound data mining
  • Empirical comparisons between between different statistical validation methods and possibly other goodness measures
  • Insightful positition papers

We particularly encourage submissions which compare different schools of statistics, like frequentist (Neyman-Pearsonian or Fisherian) vs. Bayesian, or analytic vs. empirical significance testing. Equally interesting are submissions introducing generic school-independent computational methods. You can also submit papers describing works-in-progress.

Organization

Workshop Chairs

  • Wilhelmiina Hämäläinen, Academy of Finland/Department of Computer Science, Aalto University, Finland.
    firstname.lastname@gmail.com Replace 'ä'('a with two dots') by 'a'
  • Geoff Webb, Faculty of Information Technology, Monash University, Australia.
    firstname.lastname@monash.edu

Programme Committee


Peter Flach, University of Bristol, UK
Wilhelmiina Hämäläinen, Aalto University, Finland
Florian Lemmerich, University of Würzburg, Germany
Cecile Low-Kam,  Montreal Heart Institute, Canada
Siegfried Nijssen,  Leiden University, Netherlands
Francois Petitjean, Monash University, Australia
Chedy Raissi, INRIA, France
Jan Ramon, INRIA, France
Jun Sese, AIST, CBRC, Japan
Koji Tsuda, University of Tokyo/AIST, Japan
Geoff Webb, Monash University, Australia

Important Dates

Paper submission deadline: Sun July 17, 2016 (extended, was Mon July 4)
Paper acceptance notification: Sun August 14, 2016 (was Mon July 25, 2016)
Paper camera-ready deadline: Fri, September 2 (was Mon August 8, 2016)
Problem submission: Mon September 12, 2016 (preferably earlier)
Workshop date: Monday, September 19, 2016

Paper Submission

The papers can be either regular papers (recommended maximum length 12 pages in the LNCS format) or short papers (6 pages). These page limits are somewhat flexible.

All papers will be peer-reviewed by 2-3 reviewers. The accepted papers will be presented at the workshop and included in the workshop proceedings. The proceedings will be published in the JMLR: Workshop and Conference Proceedings series after the conference.

Submit your paper as pdf by EasyChair SSDM'16 submission page.

If you have good problem ideas for groupworks, you can send them directly to Wilhelmiina Hämäläinen by email.

Program

The workshop is implemented in an untraditional way with four invited speeches and discussion on open problems.

A preliminary schedule (Monday, September 19)

09:00 - 10:40 session I (100 min)
 * 9:00 Opening
 * 9:15-10:15 Koji Tsuda: Significant Pattern Mining: Efficient Algorithms and Biomedical Applications
 * 10:15 Matthijs van Leeuwen: Expect the unexpected - On the significance of subgroups
10:40 - 11:00 coffee break (20 min)
11:00 - 12:40 session II (100 min)
 * 11:00 Francois Petitjean: Scaling log-linear analysis to datasets with thousands of variables
 * 11:40 Jan Ramon: Statistically sound analysis of populations resulting from haplotype evolution
 * 12:10 Closing discussion with open problems
12:40 Lunch break

Abstracts and other material

Koji Tsuda: Significant Pattern Mining: Efficient Algorithms and Biomedical Applications.
Pattern mining techniques such as itemset mining, sequence mining and
graph mining have been applied to a wide range of datasets. To
convince biomedical researchers, however, it is necessary to show
statistical significance of obtained patterns to prove that the
patterns are not likely to emerge from random data. The key concept of
significance testing is family-wise error rate, i.e., the probability
of at least one pattern is falsely discovered under null hypotheses.
In the worst case, FWER grows linearly to the number of all possible
patterns. We show that, in reality, FWER grows much slower than the
worst case, and it is possible to find significant patterns in
biomedical data. The following two properties are exploited to
accurately bound FWER and compute small p-value correction factors. 1)
Only closed patterns need to be counted. 2) Patterns of low support
can be ignored, where the support threshold depends on the Tarone
bound. We introduce efficient depth-first search algorithms for
discovering all significant patterns and discuss about parallel
implementations.

presentation: http://www.cs.hut.fi/~hamalaw2/kojitsuda_lamp.pdf


Matthijs van Leeuwen: Expect the unexpected - On the significance of subgroups.
Within the field of exploratory data mining, subgroup discovery is
concerned with finding regions in the data that stand out with respect
to a particular target. An important question is how to validate the
patterns found; how do we distinguish a true finding from a false
discovery? A common solution is to apply a statistical significance
test that states that a pattern is real iff it is different from a
random subset.  In this paper we argue and empirically show that this
assumption is often too weak, as almost any realistic pattern language
specifies a set of subsets that strongly deviates from random
subsets. In particular, our analysis shows that one should expect the
unexpected in subgroup discovery: given a dataset and corresponding
description language, it is very likely that high-quality subgroups
can —and hence will— be found.

Paper (to appear at Discovery Science 2016):
http://patternsthatmatter.org/pubs/2016/expect_the_unexpected_significance_of_subgroups-vanleeuwen,ukkonen.pdf

Francois Petitjean: Scaling log-linear analysis to datasets with thousands of variables.
Association discovery is a fundamental data mining task.
The primary statistical approach to association discovery between
variables is log-linear analysis. Classical approaches to log-linear
analysis do not scale beyond a dozen variables.
I will explain how, drawing from research in statistics, machine
learning, graph theory and data mining, we have developed methods to
scale log-linear analysis to datasets with thousands of variables on a
standard desktop computer. Our solution, 'Chordalysis', combines chordal
graphs, junction trees and advanced data structures borrowed from
frequent pattern mining. It allows to model datasets with thousands of
variables in seconds without sacrificing the statistical soundness of
the process.

Jan Ramon: Statistically sound analysis of populations resulting from haplotype evolution

Material coming.