Program

Date - May 1st, 2010

Renaissance Columbus Downtown Hotel
Columbus, Ohio, USA

Invited talks:

  • "Pig: A high level language for doing large scale data processing using Hadoop", Benjamin Reed (PIG Team, Yahoo! Research)
  • "Datawarehousing and Analytics Infrastructure at Facebook",  Ashish Thusoo (HIVE Team, Facebook)
  • "Hadoop-ML: An Infrastructure for the Rapid Implementation of Parallel Reusable Analytics", Amol Ghoting (IBM Research)
  • "Large-scale Machine Learning using DryadLINQ", Mihai Budiu (DRYAD/DRYAD-LINQ Team, Microsoft Research)
  • "Mining under Flexible Resource Constraints  - robust and versatile methods that scale gracefully", Wei Fan (IBM Research)
  • "Peta Scale Graph Mining with Pegasus", U Kang (Carnegie Mellon University)
Paper presentations:
  • A Data Intensive Multi-chunk Ensemble Technique to Classify Stream Data Using Map-Reduce Framework
  • Mining Frequent Highly-Correlated Item-Pairs at Very Low Support Levels
  • GPUML: Graphical processors for speeding up kernel machines
  • Database Support for Bregman Co-clustering

8:50  - 9 am
 Opening remarks
9 - 9:30 am
Invited presentation - (SLIDES)

Large-scale Machine Learning using DryadLINQ

Mihai Budiu, Researcher, Microsoft Research, Silicon Valley
 
We describe DryadLINQ, a general-purpose system for large scale data-parallel computing, and illustrate its use on a number of machine learning problems. DryadLINQ was designed to make it easier for non-specialists to write general purpose, scalable programs that can operate on very large input datasets. In order to appeal to non-specialists we designed the programming interface to use a high level of abstraction that insulates the programmer from most of the detail and complexity of parallel and distributed execution. In order to support general-purpose computing we embedded these high-level abstractions in .NET, giving developers access to full-featured programming languages with rich type systems and proven mechanisms, such as classes and libraries, for managing complex, long-lived and geographically distributed software projects. In order to support scalability over very large data and compute clusters we arranged that the DryadLINQ compiler generate code for the Dryad runtime, a well-tested and highly efficient distributed execution engine.
 
As machine learning moves into the mainstream and operates over diverse data types including documents, images and graphs, it is increasingly appealing to move away from domain-specific languages that were primarily designed for linear algebra, and towards general-purpose languages that support rich types and standardized libraries. We  demonstrate that a general-purpose language such as C# supports effective, concise implementations of standard machine learning algorithms, and that DryadLINQ efficiently scales these implementations to operate over hundreds of computers and very large datasets.
 
DryadLINQ and Dryad are available for download using an academic license.

Speaker bio:
 
 
Mihai Budiu has received a Ph.D. in computer science from Carnegie Mellon University, working on computer architecture and reconfigurable computing. Since 2004 he has been a member of Microsoft Research in Silicon Valley, where he has worked on computer security, performance analysis and large scale cluster computing. He is currently working on building high-level abstractions for large-scale computation, including programming languages, runtimes, data structures, libraries and algorithms.


 9:30 - 10 am
Paper presentation (PAPER) (SLIDES)

A Data Intensive Multi-chunk Ensemble Technique to Classify Stream Data Using Map-Reduce Framework

Tahseen Al-Khateeb, Mohammad Salim Ahmed, Mohammad Masud and Latifur Khan
University of Texas at Dallas

We propose a data intensive and distributed multi-chunk ensemble classifier based data mining technique to classify data streams. In our approach, we combine r most recent consecutive data chunks with data chunks in the current ensemble and generate a new ensemble using this data for training. By introducing this multi-chunk ensemble technique in a Map-Reduce framework and considering the concept-drift of the data, we significantly reduce the running time and classification error compared to different ensemble approaches. We have empirically proved its effectiveness over other state-of-the-art stream classification techniques on synthetic data and real world botnet traffic.

 10 - 10:30 am
 Coffee break
 10:30 - 11 am
 Invited presentation - (SLIDES)

Datawarehousing and Analytics Infrastructure at Facebook

Ashish Thusoo, Facebook
 
Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis and business intelligence applications used by analysts across the company, a number of Facebook products are also based on analytics. These products range from simple reporting applications like Insights for the Facebook Ad Network, to more advanced kinds such as Facebook's Lexicon product. As a result a flexible infrastructure that caters to the needs of these diverse applications and users and that also scales up in a cost effective manner with the ever increasing amounts of data being generated on Facebook, is critical. Hive and Hadoop are the technologies that we have used to address these requirements at Facebook. In this talk we will present the general architecture of log collection and datawarehousing at facebook and also introduce core building blocks that we have used to implement a peta byte scale datawarehousing solution.
 
Speaker bio:

Ashish Thusoo has been with Facebook for the last couple of years and is managing the Facebook data infrastructure team in his most recent role. He started the Hive project at Facebook along with Joydeep and serves at the project lead for Hive at Apache. He is also part of the Hadoop PMC at Apache and has presented Hive at a number of conferences, forums and panels. Ashish has deep expertise in data processing and parallel processing technologies, infrastructure and applications built on those infrastructures. In the past he has worked at Oracle in areas of Parallel Query Execution as well as XML Databases. At Oracle he built many core data warehousing and query processing features and was recognized as one of the leaders in the Parallel Execution team. These features are regularly used in most Oracle based data warehouses.

 11 - 11:30 am   
 Paper presentation - (PAPER) (SLIDES)

Mining Frequent Highly-Correlated Item-Pairs at Very Low Support Levels

Ian Sandler and Alex Thomo
University of Victoria, British Columbia, Canada

The ability to extract frequent pairs from a set of transactions is one of the fundamental building blocks of data mining. When the number of items in a given transaction is relatively small the problem is trivial. Even when dealing with millions of transactions it is still trivial if the number of unique items in the transaction set is small. The problem becomes much more challenging when we deal with millions of transactions, each containing hundreds of items that
are part of a set of millions of potential items. Especially when we are looking for highly correlated results at extremely low support levels. For 25 years the Direct Hashing and Pruning (Park Chen Yu) (PCY) algorithm has been the principal
technique used when there are billions of potential pairs that need to be counted. In this paper we propose a new approach that allows us to take full advantage of both multi-core and multi-CPU availability which works in cases where PCY fails, with excellent performance scaling that continues even when the number of processors, unique items and items per transaction are at their highest. We believe that our approach has much broader applicability in the field of co-occurrence counting, and can be used to generate much more interesting results when mining very large data sets.

 11:30 - 12 pm
Invited presentation - (SLIDES)

Mining under Flexible Resource Constraints  - robust and versatile methods that scale gracefully

Wei Fan, IBM T. J. Watson Research Center

    One important design goal for high performance analytics is the ability to "scale gracefully" from limited resources (one single computer) to lots of resources (HPC, clouds, stream computing platform). The "gracefulness" includes fast model construction under limited resources, formally guaranteed and improved performance with more resources, "any time" availability during model construction, flexibility to solve a wide range of problems (classification, clustering, regression, multi-class, multi-label) problems. The suits of solutions is required not only to handle traditional data in feature vector format that we are familiar of, but also can extract features from raw data not in feature vector format (transactional data, graph data, sequence data). In this talk, we will discuss two methods that appear simple and straightforward, but have all these properties mentioned above. The various applications include chip manufacturing, drug design, DNA sequence, intrusion detection signature, etc.


Speaker bio:

    Dr. Wei Fan received his PhD in Computer Science from Columbia University in 2001 and has been working in IBM T.J.Watson Research since 2000. He published more than 70 papers in top data mining, machine learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD, SIGMOD, VLDB, ICDE, AAAI, ICML etc. Dr. Fan has served as Area Chair, Senior PC of SIGKDD'06/10, SDM'08/10 and ICDM'08/09, sponsorship co-chair of SDM'09, award commitee member of ICDM'09, workshop co-chair of ICDM'10, as well as PC of several prestigious conferences in the area including KDD'09/8/07/05, ICDM'07/06/05/04/03, SDM'09/07/06/05/04, CIKM'09/08/07/06, ECML/PKDD'07'06, ICDE'04, AAAI'07, PAKDD'09/08/07, EDBT'04, WWW'09/08/07, etc. He is on the advisory board of KD2U. Dr. Fan was invited to speak at ICMLA'06. He served as US NSF panelist in 2007/08. His main research interests and experiences are in various areas of data mining and database systems, such as, risk analysis, high performance computing, extremely skewed distribution, cost-sensitive learning, data streams, ensemble methods, easy-to-use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, novel applications and commercial data mining systems. He is particularly interested in simple, unconventional, but effective methods to solve difficult problems. His thesis work on intrusion detection has been licensed by a start-up company since 2001. His co-teamed submission that uses Random Decision Tree has won the ICDM'08 Contest Crown Awards. His co-authored paper in ICDM'06 that uses "Randomized Decision Tree" to predict skewed ozone days won the best application paper award. His co-authored paper in KDD'97 on distributed learning system "JAM" won the runner-up best application paper award.

 12 - 1:30 pm
 Lunch break
 1:30 - 2pm
 Invited presentation - (SLIDES)

Hadoop-ML: An Infrastructure for the Rapid Implementation of Parallel Reusable Analytics

Amol Ghoting, IBM Research

Hadoop is an open-source implementation of Google's Map-Reduce programming model. Over the past few years, it has evolved into a popular platform for parallelization in industry and academia. Furthermore, trends suggest that Hadoop will likely be the analytics platform of choice on forthcoming Cloud-based systems. Unfortunately, implementing parallel machine learning/data mining (ML/DM) algorithms on Hadoop is complex and time consuming. To address this challenge, we present Hadoop-ML, an infrastructure to facilitate the implementation of parallel ML/DM algorithms on Hadoop. Hadoop-ML has been designed to allow for the specification of both task-parallel and data-parallel ML/DM algorithms. Furthermore, it supports the composition of parallel ML/DM algorithms using both serial as well as parallel building blocks -- this allows one to write reusable parallel code. The proposed abstraction eases the implementation process by requiring the user to only specify computations and their dependencies, without worrying about scheduling, data management, and communication. As a consequence, the codes are portable in that the user never needs to write Hadoop-specific code. This potentially allows one to leverage future parallelization platforms without rewriting one's code.

Speaker bio:

Amol Ghoting joined the Data Mining Systems group at IBM Research in October, 2007 shortly after receiving his Ph.D. in Computer Science and Engineering from The Ohio State University. His research interests lie in data mining, high performance computing, and architecture-conscious algorithms. He is especially interested in designing algorithms and systems support to process and analyze large data sets, while ensuring efficient utilization of modern and emerging computer architectures. His dissertation research showed how data and knowledge reuse can significantly improve the performance of data mining algorithms. He received his B.E. in Computer Engineering from Victoria Jubilee Technical Institute, University of Bombay in 2000 and his M.S. in Computer Science from the University of Southern California in 2001. He is the recipient of a Best Paper Award at the 2005 International Conference on Very Large Databases, an Outstanding Research Award by The Ohio State University (2005), an IBM Ph.D. Fellowship (2006), a Best of SIGMOD selection (2009), and was a finalist for the ACM Gordon Bell Award (2009).


 2 - 2:30 pm
Paper presentation - (PAPER) (SLIDES)

Database Support for Bregman Co-clustering

Kuo-Wei Hsu, Jaideep Srivastava
University of Minnesota, Minnesota, USA

As the amount of data generated and collected becomes larger and more complicated, scalable tools for
effective data mining become more important. Because data mining is a broad area of research, we focus on co-clustering, a new yet promising subarea of data mining. Co-clustering performs two-way clusterings and simultaneously clusters row and column entities. Specifically we consider a recently proposed algorithm, Bregman co-clustering algorithm, which has shown significant promise for clustering quality and hence gained popularity. However, a recent study has demonstrated that a main memory based implementation of the algorithm creates difficulty for applications with large data sets. Therefore, the focus of this paper is a scalable implementation of Bregman co-clustering algorithm. We discuss how summary statistics required by the algorithm can be stored in a data cube and computed by an OLAP engine. We conduct experiments using several real data sets from various domains, including matrix decomposition, bioinformatics, document clustering, and collaborative filtering (CF) based recommendation. Experimental results demonstrate the potential of our OLAP based implementation. Moreover, our implementation has a large prospective user base as it works on top of OLAP, a widely deployed technique; further, it facilitates the use of Bregman co-clustering algorithm for applications with large data sets while co-clustering is finding applications in various problems. The research is a step towards the increasing needs in connecting database and data mining.

 2:30 - 3 pm
  Invited presentation - (SLIDES)

Pig: A high level language for doing large scale data processing using Hadoop

Benjamin Reed, Yahoo! Research

Map/Reduce has become a common programming model for processing large data sets. In particular the open source implementation of Map/Reduce, called Hadoop, allows users to write parallel programs that run on large clusters of commodity machines. The simplicity of the Map/Reduce paradigm has enabled Hadoop to be efficient and scalable. 
However, when programmers write directly to the Map/Reduce API they find that operations such as filtering, projecting, joining, and splitting are tedious and repetitive to implement. The Map/Reduce API also does not expose much of the semantics of the operations being coded, so opportunities for optimizations are missed. In this talk I will review the Map/Reduce processing paradigm and explain how we built Pig on top of it to enable common operations to be expressed easily in a high level language, called Pig Latin. I will explain how Pig Latin programs are compiled into Map/Reduce programs and executed on the Hadoop framework. I will also talk about the advantages of using Pig over Map/Reduce directly. Pig is subproject of Apache Hadoop.

 3 - 3:30 pm
 Coffee break
 3:30 - 4 pm
Paper presentation - (PAPER) (SLIDES)

GPUML: Graphical processors for speeding up kernel machines

Balaji Vasan Srinivasan, Qi Hu, and Ramani Duraiswami
University of Maryland, College Park, MD, USA

Algorithms based on kernel methods play a central role in statistical machine learning. At their core are a number
of linear algebra operations on matrices of kernel functions which take as arguments the training and testing
data. These range from the simple matrix-vector product, to more complex matrix decompositions, and
iterative formulations of these. Often the algorithms scale quadratically or cubically, both in memory and operational
complexity, and as data sizes increase, kernel methods scale poorly. We use parallelized approaches on a multi-core graphical processor (GPU) to partially address this lack of scalability. GPUs are used to scale three different classes of problems, a simple kernel matrix-vector product, iterative solution of linear systems of kernel function and QR and Cholesky decomposition of kernel matrices. Application of these accelerated approaches in scaling several kernel based learning approaches are shown, and in each case substantial speedups are obtained. The core software is released as
an open source package, GPUML.

 4 - 4:30 pm
 Invited presentation - (SLIDES)

Peta Scale Graph Mining with Pegasus

U Kang, Carnegie Mellon University

We describe Pegasus, an open source Peta Graph Mining library which performs typical graph mining tasks such as finding the connected components, computing the diameter and radii, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, Pegasus is the first such library on the top of the  Hadoop platform. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. We describe a very important primitive for Pegasus, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving good scale-up on the number of machines and edges. We run experiments on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs with 6,7 billion edges. 

Speaker bio:

U Kang is currently a Ph.D. candidate in the Computer Science Department at Carnegie Mellon University. He has received the best application paper runner-up award in ICDM 2009. His main research interests lie in the fields of large scale graph mining.
 4:30 - 4:40 pm
 Closing remarks