Summer School on Massive Data Mining

Dates: August 8-10, 2012.
Place: Aud. 2, main bldg., IT University of Copenhagen (ITU), Denmark. (Directions by metro.)
Audience: The summer school is aimed at PhD students and young researchers both from the algorithms community and the data mining community. A typical participant will come from a group that aims at publishing in algorithms conferences such as ESA and SODA, and/or in data mining conferences such as ICDM and KDD.
Credit: The ITU PhD school will issue a diploma of 4 ECTS to participants who have actively participated in the school, including presenting a poster.
Registration: The registration system is closed, but on-site registration (€120 with credit card) is possible. Covers lunches and coffee breaks.
Organizers: Rasmus Pagh (chair), Annelie Jepsen (admin), Konstantin Kutzkov, Ninh Pham, Morten Stöckel

Topics and speakers

Matrix and Graph Algorithms: Approximation Algorithms, Implicit Statistical Regularization, and Very Large-scale Applications. (slides1, slides2, slides3, slides4)
Michael Mahoney, Stanford University.

Toon Calders, Eindhoven University of Technology

Suresh Venkatasubramanian, University of Utah

Schedule (pdf)

August 8:
8.00- 9.00 Registration
9.00-12.00 Welcome.
12.00-13.00 Lunch break
13.00-17.00 A decade of mining patterns from large datasets (T. Calders)
            Algorithms for mining large graphs (A. Gionis)
17.30-21.00 Social event

August 9:
9.00-12.00 Algorithms for mining large graphs (A. Gionis)
           Matrix and Graph Algorithms (M. Mahoney)(slides1slides2slides3slides4)
12.00-13.00 Lunch break
13.00-14.30 Panel discussion
15.00-17.00 Poster session

August 10:
9.00-12.00 Matrix and Graph Algorithms (M. Mahoney)
           Clustering and Metaclustering (S. Venkatasubramanian)
12.00-13.00 Lunch break
13.00-16.00 Clustering and Metaclustering (S. Venkatasubramanian)

Eager to learn even more? Consider attending Advanced Topics in Machine Learning, August 13-17 at DTU, Copenhagen, Domain Adaptation in Image Analysis at DIKU, Copenhagen and/or Algorithms for Modern Parallel and Distributed Models August 20-23 at MADALGO, Aarhus.

Summer school participants

Description of contents
We encourage participants to read articles marked with an asterix (*) before the summer school.

A decade of mining patterns from large datasets: advances and open problems
- Pattern mining in general
o connection with listing hypergraphs transversals
o some complexity results
o extensions to sequences/graphs/...
- Pattern explosion problem and solutions throughout last decade
o condensed representations (closed sets, NDIs)
o approaches to assess significance/surprisingness of a pattern
e.g., swap-randomization types of work; p-value based methods;
MDL-based methods; patterns as a "summary" of the data 

Algorithms for mining large graphs
- algorithms for counting triangles, clustering coefficient, minors, computing shortest-path distances, distribution of distances,...
Approximate computation of graph statistics: Counting triangles in data streams *, Efficient algorithms for large-scale local triangle countingTriangle sparsifiers
Shortest-path queries: Shortest-Path Queries in Static Networks *
Graph-mining algorithms in MapReduce: Filtering: a method for solving graph problems in MapReduceDensest Subgraph in Streaming and MapReduceSocial content matching in MapReduce

Matrix and Graph Algorithms:
Approximation Algorithms, Implicit Statistical Regularization, and Very Large-scale Applications
- Introduction: Algorithmic and Statistical Perspectives
- Randomized Matrix Algorithms: and Large-scale Applications
- Social and Information Networks: Algorithms and Structure
- Approximate Computation and Implicit Regularization

Clustering and Metaclustering: When a single answer isn't enough
Clustering is one of the most basic tools in data mining. Given a collection of items and some idea of similarity (or distance) between pairs of items, the goal of clustering is to group them so that similar objects are in a group, and objects not in groups are distant from each other.
Over the years, a vast body of work has grown up around the problem of clustering. In most cases, the variations have come from different ways to define "distance", "similarity", and "groups". Once these are defined, a cost function as chosen, and the clustering is the grouping that optimizes this cost.
But what if the optimal answer isn't the right one ? What if NO single answer is the right one? In recent years, we've come to realize that trusting any single clustering algorithm to produce meaningful answers is dangerous, and that the path to meaningful answers requires many different "views" of the data.
This is the area of 'metaclustering' - can we combine information from different ways to cluster data to get at deeper structure in data, rather than merely relying on the output of single procedure?
I'll present an overview of clustering techniques, and outline some of the key problems of metaclustering, such as measuring the distance between clusterings, how to find consensus between clusterings, and how to explore the space of clusterings to find new and interesting answers.

Practical information:
- Hotel: The IT University is reachable by the Metro from many hotels in downtown Copenhagen. Another possibility is to stay in Ørestad, between the airport and ITU. There is the DanHostel, as well as CabInn Metro. In either case, expect to use about 15 minutes to reach ITU.

Rasmus Pagh,
Aug 7, 2012, 11:14 PM
Rasmus Pagh,
Aug 8, 2012, 11:45 PM
Rasmus Pagh,
Aug 13, 2012, 6:11 AM
Rasmus Pagh,
Aug 6, 2012, 12:37 AM