Keynotes and Tutorials

(Keynote) Flexibility without Anarchy: Analytics Infrastructure at Twitter

Speaker: Jimmy Lin, Twitter

Abstract: The data analytics infrastructure at Twitter supports a myriad of technologies: Hadoop, Pig (with Python and JRuby), Cascading/Scalding, HBase, MySQL, Vertica, and ZooKeeper. Our philosophy is to let developers and data scientists use whatever tools they are most comfortable with, while allowing individual components to be weaved together into complex analytic tapestries. Managing complex workflows that cross language boundaries (e.g. Java vs. Pig vs. Scala) as well as architectures with significant impedance mismatches (e.g., Hadoop vs. Vertica) has been and continues to remain a significant challenge. In this talk, I'll detail some of these issues and our present solutions.

Short Bio: Jimmy Lin is a visiting scientist at Twitter, currently on leave from the University of Maryland. His current research focuses on scalable algorithms for data analytics, particularly on text and graph data. At Twitter, he works on services designed to surface relevant content for users and the distributed infrastructure that supports mining relevance signals from massive amounts of data.

(Keynote) Data Processing Workflows @ Google

Pawel Garbacki, Google

Abstract: Workflow management is the term that Google uses to describe how we control the execution of data processing workflows running on top of our computing infrastructure. Data processing workflows at Google build search indices, compute ads placement, identify copyrighted YouTube videos, construct geo maps, and perform a myriad of other batch tasks. Some of these tasks can be implemented on top of a self-contained architecture such as Pregel or FlumeJava but we observe a growing class of tasks modeled as workflows which cross the boundary of a single architecture, e.g., a generic MapReduction feeding data to a Pregel computation that in turn creates output in a format optimized for Tenzing queries. In this talk I will discuss the properties of data processing workflows at Google and identify some of the challenges we are facing in supporting general-purpose worfklows. Many of these challenges represent interesting research opportunities.

Short bio: Pawel Garbacki holds a PhD in distributed systems from Delft University of Technology, Delft, The Netherlands. In the summer of 2005 and 2006 he worked in IBM T.J. Watson Research lab on a virtualized application execution platform. In 2008 he joined Google where he works in the area of workflow management.

(Tutorial) Oozie: a workflow engine for Hadoop

Speaker: Mohammad Islam, Yahoo/Cloudera

Abstract: Hadoop is a massively scalable parallel computation platform capable of running hundreds of jobs concurrently, and many thousands of jobs perday. Managing allthese computations demands for a workflow and scheduling system. In this paper, weidentify four indispensable qualities that a Hadoop workflow management system must fulfill namely Scalability, Security, Multi-tenancy, and Operability. We find that conventional workflow management tools lack at least one of these qualities, and therefore present Apache Oozie, a workflow management system specialized for Hadoop. We discuss the architecture of Oozie, share our production experience over the last few years at Yahoo, and evaluate Oozie's scalability and performance.

Short bio: Mohammad Islam is working as a Principal Engineer/ Technical Lead for the past three years on Hadoop at Yahoo!. He is a committer of Apache Oozie project. Before that, he worked in the Telecom industry for more than a decade. Mohammad Islam has a PhD in Computer Science specializing in high performance computing and parallel job scheduling