Statistically Sound Data Mining
Following on from the previous successful ECML/PKDD workshop SSDM'14
we will again bring together researchers in this significant and
topical field.
The field of statistics has developed sophisticated, well-founded
methods for inference from data. While some of these place
computational or practical limits that make them infeasible to apply
directly to many data mining problems, the field of data mining has
much to gain from a more sophisticated understanding of the strengths
and limitations of these techniques and from greater utilization of
them where they are appropriate.
As an answer to this dilemma, there is emerging a clear
trend towards statistically sound data mining. The main impetus for
this new trend is coming from a third party, the application
fields. In the computerized world, it is easy to collect large data
sets but their analysis is more difficult. Knowing the traditional
statistical tests is no more sufficient for scientists, because one
should first find the most promising hidden patterns and models to be
tested. This means that there is an urgent need for efficient data
mining algorithms which are able to find desired patterns, without
missing any significant discoveries or producing too many spurious
ones. A related problem is to find a statistically justified
compromise between underfitted (too generic to catch all important
aspects) and overfitted (too specific, holding just due to chance)
patterns. However, before any algorithms can be designed, one should
first solve many principal problems, like how to define the
statistical significance of desired patterns, how to evaluate
overfitting, how to interprete the p-values when multiple patterns are
tested, and so on. In addition, one should evaluate the existing data
mining methods, alternative algorithms and goodness measures to see
which of them produce statistically valid results.
As we can see, there are many important problems which should be
worked together with people from Data mining, Machine learning, and
Statistics as well as application fields. The goal of this workshop is
to offer a meeting point for this discussion. We want to bring together
people from different backgrounds and schools of science, both
theoretically and practically oriented, to specify problems, share
solutions and brainstorm new ideas.
To encourage real workshopping of actual problems, the workshop is
arranged in a novel way, containing an invited lecture and inspiring
groupworks in addition to traditional presentations. This means that
also the non-author participants can contribute to workshop results.
If you have relevant problems which you would like to be worked
together in the workshop, please send them before the workshop.
Topics of interest include but are not limited to:
- Useful and relevant theoretical results
- Search methods for statistically valid patterns and models
- Statistical validation of discovered patterns
- Evaluating statistical significance of clustering
- Statistical techniques for avoiding overfitted patterns
- Scaling statistical techniques to high-dimensionality and high
data quantity, covering both theoretical problems (like multiple testing
problem) and computational problems (calculating required test measures
efficiently)
- Interesting applications with real world data demonstrating statistically
sound data mining
- Empirical comparisons between between different statistical
validation methods and possibly other goodness measures
- Insightful positition papers
We particularly encourage submissions which compare different schools
of statistics, like frequentist (Neyman-Pearsonian or Fisherian)
vs. Bayesian, or analytic vs. empirical significance testing. Equally
interesting are submissions introducing generic school-independent
computational methods. You can also submit papers describing works-in-progress.
Workshop Chairs
Programme Committee
Peter Flach, University of Bristol, UK
Wilhelmiina Hämäläinen, Aalto University, Finland
Florian Lemmerich, University of Würzburg, Germany
Cecile Low-Kam, Montreal Heart Institute, Canada
Siegfried Nijssen, Leiden University, Netherlands
Francois Petitjean, Monash University, Australia
Chedy Raissi, INRIA, France
Jan Ramon, INRIA, France
Jun Sese, AIST, CBRC, Japan
Koji Tsuda, University of Tokyo/AIST, Japan
Geoff Webb, Monash University, Australia
Paper submission deadline: Sun July 17, 2016 (extended, was Mon July 4)
Paper acceptance notification: Sun August 14, 2016 (was Mon July 25, 2016)
Paper camera-ready deadline: Fri, September 2 (was Mon August 8, 2016)
Problem submission: Mon September 12, 2016 (preferably earlier)
Workshop date: Monday, September 19, 2016
The papers can be either regular papers (recommended maximum length 12
pages in the LNCS format) or short papers (6 pages). These page limits
are somewhat flexible.
All papers will be peer-reviewed by 2-3 reviewers. The accepted
papers will be presented at the workshop and included in the workshop
proceedings. The proceedings will be published in the JMLR: Workshop and Conference Proceedings series after the conference.
Submit your paper as pdf by EasyChair SSDM'16 submission page.
If you have good problem ideas for groupworks, you can send them directly to Wilhelmiina Hämäläinen by email.
The workshop is implemented in an untraditional way with four invited speeches and discussion on open problems.
A
preliminary schedule (Monday, September 19)
09:00 - 10:40 session I (100 min)
* 9:00 Opening
* 9:15-10:15
Koji Tsuda: Significant Pattern Mining: Efficient Algorithms
and Biomedical Applications
* 10:15
Matthijs van Leeuwen: Expect the unexpected -
On the significance of subgroups
10:40 - 11:00 coffee break (20 min)
11:00 - 12:40 session II (100 min)
* 11:00
Francois Petitjean: Scaling log-linear analysis to datasets with
thousands of variables
* 11:40
Jan Ramon: Statistically sound analysis of populations resulting from
haplotype evolution
* 12:10 Closing discussion with open problems
12:40 Lunch break
Koji Tsuda: Significant Pattern Mining: Efficient Algorithms
and Biomedical Applications.
Pattern mining techniques such as itemset mining, sequence mining and
graph mining have been applied to a wide range of datasets. To
convince biomedical researchers, however, it is necessary to show
statistical significance of obtained patterns to prove that the
patterns are not likely to emerge from random data. The key concept of
significance testing is family-wise error rate, i.e., the probability
of at least one pattern is falsely discovered under null hypotheses.
In the worst case, FWER grows linearly to the number of all possible
patterns. We show that, in reality, FWER grows much slower than the
worst case, and it is possible to find significant patterns in
biomedical data. The following two properties are exploited to
accurately bound FWER and compute small p-value correction factors. 1)
Only closed patterns need to be counted. 2) Patterns of low support
can be ignored, where the support threshold depends on the Tarone
bound. We introduce efficient depth-first search algorithms for
discovering all significant patterns and discuss about parallel
implementations.
presentation:
http://www.cs.hut.fi/~hamalaw2/kojitsuda_lamp.pdf
Matthijs van Leeuwen: Expect the unexpected -
On the significance of subgroups.
Within the field of exploratory data mining, subgroup discovery is
concerned with finding regions in the data that stand out with respect
to a particular target. An important question is how to validate the
patterns found; how do we distinguish a true finding from a false
discovery? A common solution is to apply a statistical significance
test that states that a pattern is real iff it is different from a
random subset. In this paper we argue and empirically show that this
assumption is often too weak, as almost any realistic pattern language
specifies a set of subsets that strongly deviates from random
subsets. In particular, our analysis shows that one should expect the
unexpected in subgroup discovery: given a dataset and corresponding
description language, it is very likely that high-quality subgroups
can —and hence will— be found.
Paper (to appear at Discovery Science 2016):
http://patternsthatmatter.org/pubs/2016/expect_the_unexpected_significance_of_subgroups-vanleeuwen,ukkonen.pdf
Francois Petitjean: Scaling log-linear analysis to datasets with
thousands of variables.
Association discovery is a fundamental data mining task.
The primary statistical approach to association discovery between
variables is log-linear analysis. Classical approaches to log-linear
analysis do not scale beyond a dozen variables.
I will explain how, drawing from research in statistics, machine
learning, graph theory and data mining, we have developed methods to
scale log-linear analysis to datasets with thousands of variables on a
standard desktop computer. Our solution, 'Chordalysis', combines chordal
graphs, junction trees and advanced data structures borrowed from
frequent pattern mining. It allows to model datasets with thousands of
variables in seconds without sacrificing the statistical soundness of
the process.
Jan Ramon: Statistically sound analysis of populations resulting from
haplotype evolution
Material coming.