Topics and speakers
Matrix and Graph Algorithms: Approximation Algorithms, Implicit Statistical Regularization, and Very Large-scale Applications. (slides1, slides2, slides3, slides4)
Michael Mahoney, Stanford University.
Toon Calders, Eindhoven University of Technology
Suresh Venkatasubramanian, University of Utah
8.00- 9.00 Registration
A decade of mining patterns from large datasets (T. Calders)
12.00-13.00 Lunch break
13.00-17.00 A decade of mining patterns from large datasets (T. Calders)
Algorithms for mining large graphs (A. Gionis)
17.30-21.00 Social event
9.00-12.00 Algorithms for mining large graphs (A. Gionis)
12.00-13.00 Lunch break
13.00-14.30 Panel discussion
15.00-17.00 Poster session
9.00-12.00 Matrix and Graph Algorithms (M. Mahoney)
Clustering and Metaclustering (S. Venkatasubramanian)
12.00-13.00 Lunch break
13.00-16.00 Clustering and Metaclustering (S. Venkatasubramanian)
Eager to learn even more? Consider attending Advanced Topics in Machine Learning, August 13-17 at DTU, Copenhagen, Domain Adaptation in Image Analysis at DIKU, Copenhagen and/or Algorithms for Modern Parallel and Distributed Models August 20-23 at MADALGO, Aarhus.
Description of contents
We encourage participants to read articles marked with an asterix (*) before the summer school.
A decade of mining patterns from large datasets: advances and open problems
- Pattern mining in general
o connection with listing hypergraphs transversals
o some complexity results
o extensions to sequences/graphs/...
- Pattern explosion problem and solutions throughout last decade
o condensed representations (closed sets, NDIs)
o approaches to assess significance/surprisingness of a pattern
e.g., swap-randomization types of work; p-value based methods;
MDL-based methods; patterns as a "summary" of the data
Algorithms for mining large graphs
- algorithms for counting triangles, clustering coefficient, minors, computing shortest-path distances, distribution of distances,...
Reading:Approximate computation of graph statistics: Counting triangles in data streams *, Efficient algorithms for large-scale local triangle counting, Triangle sparsifiers
Shortest-path distance distributions: ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs *, HyperANF: approximating the neighbourhood function of very large graphs on a budget, Four Degrees of Separation
Shortest-path queries: Shortest-Path Queries in Static Networks *
Graph-mining algorithms in MapReduce: Filtering: a method for solving graph problems in MapReduce, Densest Subgraph in Streaming and MapReduce, Social content matching in MapReduce
Matrix and Graph Algorithms:
Approximation Algorithms, Implicit Statistical Regularization, and Very Large-scale Applications
- Introduction: Algorithmic and Statistical Perspectives
- Randomized Matrix Algorithms: and Large-scale Applications
- Social and Information Networks: Algorithms and Structure
- Approximate Computation and Implicit Regularization
Clustering and Metaclustering: When a single answer isn't enough
Clustering is one of the most basic tools in data mining. Given a collection of items and some idea of similarity (or distance) between pairs of items, the goal of clustering is to group them so that similar objects are in a group, and objects not in groups are distant from each other.
Over the years, a vast body of work has grown up around the problem of clustering. In most cases, the variations have come from different ways to define "distance", "similarity", and "groups". Once these are defined, a cost function as chosen, and the clustering is the grouping that optimizes this cost.
But what if the optimal answer isn't the right one ? What if NO single answer is the right one? In recent years, we've come to realize that trusting any single clustering algorithm to produce meaningful answers is dangerous, and that the path to meaningful answers requires many different "views" of the data.
This is the area of 'metaclustering' - can we combine information from different ways to cluster data to get at deeper structure in data, rather than merely relying on the output of single procedure?
I'll present an overview of clustering techniques, and outline some of the key problems of metaclustering, such as measuring the distance between clusterings, how to find consensus between clusterings, and how to explore the space of clusterings to find new and interesting answers.
- Meta Clustering *
- Hotel: The IT University is reachable by the Metro from many hotels in downtown Copenhagen. Another possibility is to stay in Ørestad, between the airport and ITU. There is the DanHostel, as well as CabInn Metro. In either case, expect to use about 15 minutes to reach ITU.