# Statistically Sound Data Mining

Following on from the previous successful ECML/PKDD workshop SSDM'14
we will again bring together researchers in this significant and
topical field.

The field of statistics has developed sophisticated, well-founded
methods for inference from data. While some of these place
computational or practical limits that make them infeasible to apply
directly to many data mining problems, the field of data mining has
much to gain from a more sophisticated understanding of the strengths
and limitations of these techniques and from greater utilization of
them where they are appropriate.

As an answer to this dilemma, there is emerging a clear
trend towards statistically sound data mining. The main impetus for
this new trend is coming from a third party, the application
fields. In the computerized world, it is easy to collect large data
sets but their analysis is more difficult. Knowing the traditional
statistical tests is no more sufficient for scientists, because one
should first find the most promising hidden patterns and models to be
tested. This means that there is an urgent need for efficient data
mining algorithms which are able to find desired patterns, without
missing any significant discoveries or producing too many spurious
ones. A related problem is to find a statistically justified
compromise between underfitted (too generic to catch all important
aspects) and overfitted (too specific, holding just due to chance)
patterns. However, before any algorithms can be designed, one should
first solve many principal problems, like how to define the
statistical significance of desired patterns, how to evaluate
overfitting, how to interprete the p-values when multiple patterns are
tested, and so on. In addition, one should evaluate the existing data
mining methods, alternative algorithms and goodness measures to see
which of them produce statistically valid results.

As we can see, there are many important problems which should be
worked together with people from Data mining, Machine learning, and
Statistics as well as application fields. The goal of this workshop is
to offer a meeting point for this discussion. We want to bring together
people from different backgrounds and schools of science, both
theoretically and practically oriented, to specify problems, share
solutions and brainstorm new ideas.

To encourage real workshopping of actual problems, the workshop is
arranged in a novel way, containing an invited lecture and inspiring
groupworks in addition to traditional presentations. This means that
also the non-author participants can contribute to workshop results.
If you have **relevant problems** which you would like to be worked
together in the workshop, please send them before the workshop.

Topics of interest include but are not limited to:

- Useful and relevant theoretical results
- Search methods for statistically valid patterns and models
- Statistical validation of discovered patterns
- Evaluating statistical significance of clustering
- Statistical techniques for avoiding overfitted patterns
- Scaling statistical techniques to high-dimensionality and high
data quantity, covering both theoretical problems (like multiple testing
problem) and computational problems (calculating required test measures
efficiently)
- Interesting applications with real world data demonstrating statistically
sound data mining
- Empirical comparisons between between different statistical
validation methods and possibly other goodness measures
- Insightful positition papers

We particularly encourage submissions which compare different schools
of statistics, like frequentist (Neyman-Pearsonian or Fisherian)
vs. Bayesian, or analytic vs. empirical significance testing. Equally
interesting are submissions introducing generic school-independent
computational methods. You can also submit papers describing works-in-progress.

#### Workshop Chairs

#### Programme Committee

Peter Flach, University of Bristol, UK

Wilhelmiina Hämäläinen, Aalto University, Finland

Florian Lemmerich, University of Würzburg, Germany

Cecile Low-Kam, Montreal Heart Institute, Canada

Siegfried Nijssen, Leiden University, Netherlands

Francois Petitjean, Monash University, Australia

Chedy Raissi, INRIA, France

Jan Ramon, INRIA, France

Jun Sese, AIST, CBRC, Japan

Koji Tsuda, University of Tokyo/AIST, Japan

Geoff Webb, Monash University, Australia

**Paper submission deadline: Sun July 17, 2016 (extended, was Mon July 4)**
Paper acceptance notification: Sun August 14, 2016 (was Mon July 25, 2016)

**Paper camera-ready deadline: Fri, September 2** (was Mon August 8, 2016)

Problem submission: Mon September 12, 2016 (preferably earlier)

Workshop date: Monday, September 19, 2016

The papers can be either regular papers (recommended maximum length 12
pages in the LNCS format) or short papers (6 pages). These page limits
are somewhat flexible.

All papers will be peer-reviewed by 2-3 reviewers. The accepted
papers will be presented at the workshop and included in the workshop
proceedings. The proceedings will be published in the JMLR: Workshop and Conference Proceedings series after the conference.

Submit your paper as pdf by EasyChair SSDM'16 submission page.

If you have good problem ideas for groupworks, you can send them directly to Wilhelmiina Hämäläinen by email.

The workshop is implemented in an untraditional way with four invited speeches and discussion on open problems.

A

**preliminary** schedule (Monday, September 19)

09:00 - 10:40 session I (100 min)

* 9:00 Opening

* 9:15-10:15

**Koji Tsuda**: Significant Pattern Mining: Efficient Algorithms
and Biomedical Applications

* 10:15

**Matthijs van Leeuwen**: Expect the unexpected -
On the significance of subgroups

10:40 - 11:00 coffee break (20 min)

11:00 - 12:40 session II (100 min)

* 11:00

**Francois Petitjean**: Scaling log-linear analysis to datasets with
thousands of variables

* 11:40

**Jan Ramon**: Statistically sound analysis of populations resulting from
haplotype evolution

* 12:10 Closing discussion with open problems

12:40 Lunch break

**Koji Tsuda**: Significant Pattern Mining: Efficient Algorithms
and Biomedical Applications.

Pattern mining techniques such as itemset mining, sequence mining and

graph mining have been applied to a wide range of datasets. To

convince biomedical researchers, however, it is necessary to show

statistical significance of obtained patterns to prove that the

patterns are not likely to emerge from random data. The key concept of

significance testing is family-wise error rate, i.e., the probability

of at least one pattern is falsely discovered under null hypotheses.

In the worst case, FWER grows linearly to the number of all possible

patterns. We show that, in reality, FWER grows much slower than the

worst case, and it is possible to find significant patterns in

biomedical data. The following two properties are exploited to

accurately bound FWER and compute small p-value correction factors. 1)

Only closed patterns need to be counted. 2) Patterns of low support

can be ignored, where the support threshold depends on the Tarone

bound. We introduce efficient depth-first search algorithms for

discovering all significant patterns and discuss about parallel

implementations.

presentation:

http://www.cs.hut.fi/~hamalaw2/kojitsuda_lamp.pdf
**Matthijs van Leeuwen**: Expect the unexpected -
On the significance of subgroups.

Within the field of exploratory data mining, subgroup discovery is

concerned with finding regions in the data that stand out with respect

to a particular target. An important question is how to validate the

patterns found; how do we distinguish a true finding from a false

discovery? A common solution is to apply a statistical significance

test that states that a pattern is real iff it is different from a

random subset. In this paper we argue and empirically show that this

assumption is often too weak, as almost any realistic pattern language

specifies a set of subsets that strongly deviates from random

subsets. In particular, our analysis shows that one should expect the

unexpected in subgroup discovery: given a dataset and corresponding

description language, it is very likely that high-quality subgroups

can —and hence will— be found.

Paper (to appear at Discovery Science 2016):

http://patternsthatmatter.org/pubs/2016/expect_the_unexpected_significance_of_subgroups-vanleeuwen,ukkonen.pdf
**Francois Petitjean**: Scaling log-linear analysis to datasets with
thousands of variables.

Association discovery is a fundamental data mining task.

The primary statistical approach to association discovery between

variables is log-linear analysis. Classical approaches to log-linear

analysis do not scale beyond a dozen variables.

I will explain how, drawing from research in statistics, machine

learning, graph theory and data mining, we have developed methods to

scale log-linear analysis to datasets with thousands of variables on a

standard desktop computer. Our solution, 'Chordalysis', combines chordal

graphs, junction trees and advanced data structures borrowed from

frequent pattern mining. It allows to model datasets with thousands of

variables in seconds without sacrificing the statistical soundness of

the process.

**Jan Ramon**: Statistically sound analysis of populations resulting from
haplotype evolution

Material coming.