Workshop on Dynamic Distributed Data-Intensive Applications, Programming Abstractions, and Systems (3DAPAS)

To be held in conjunction with HPDC-2011, 8 June 2011, San Jose, CA

Program to be held jointly with 2nd International Emerging Computational Methods for the Life Sciences Workshop (ECMLS)

Schedule:

8:10 - 8:55 – Joint 3DAPAS and ECMLS Keynote (room Willow Glenn 3)
  • "Cloud Computing and the DNA Data Race," Michael Schatz (Cold Spring Harbor Laboratory) - slides
8:55 - 10:10 – Parallel paper tracks

 3DAPAS - room Willow Glenn 2
  • "Cosmic Microwave Background Data Analysis at the Peta-scale and Beyond," (invited paper) Julian Borrill (Lawrence Berkeley National Laboratory & University of California at Berkeley) - slides
  • "Adaptive, Secure, and Scalable Distributed Data Outsourcing: A Vision Paper," Li Xiong, Slawomir Goryczka, Vaidy Sunderam (Emory University) - slides
  • "Towards Jungle Computing with Ibis/Constellation," Jason Maassen, Niels Drost, Henri E. Bal, Frank J. Seinstra (Vrije Universiteit) - slides
 ECMLS - room Willow Glenn 3
  • "A Parallel Random Forest Classifier for R," Lawrence Mitchell, Terence M. Sloan, Muriel Mewissen, Peter Ghazal, Thorsten Forster, Michal Piotrowski, Arthur Trew (University of Edinburgh)
  • "Adapting Bioinformatics Applications for Heterogeneous Systems: a Case Study," Irena Lanc, Peter Bui, Douglas Thain, Scott Emrich (Northwestern University)
  • "Characterizing Deep Sequencing Analytics Using BFAST: Towards a Scalable Distributed Architecture for Next-Generation Sequencing Data," Joohyun Kim, Sharath Maddineni, Shantenu Jha (Louisiana State University)
1
0:10 - 10:20 – Break

10:20 - 10:45 – Parallel paper tracks

 ECMLS - room Willow Glenn 2
  • "A Hierarchical Framework for Cross-Domain MapReduce Execution, " Yuan Luo, Zhenhua Guo, Yiming Sun, Beth Plale, Judy Qiu, Wilfred Li (Indiana University)
 ECMLS - room Willow Glenn 3
  • "High-Throughput Virtual Molecular Docking: Hadoop Implementation of AutoDock4 on a Private Cloud," Sally Ellingson, Jerome Baudry (University of Tennessee)
10:45 - 11:25 – Joint 3DAPAS and ECMLS Panel (room Willow Glenn 3)
  • Dynamic Distributed Data Intensive Analysis Environments (for Life Sciences):
    Architecture and Middleware: HPC v Grid v Cloud ... and Lustre v SciDB v NOSQL ...
  • Moderator: Geoffrey Fox, Indiana University



3DAPAS Abstracts:

Cloud Computing and the DNA Data Race,"
Michael Schatz (Cold Spring Harbor Laboratory)

In the race between DNA sequencing throughput and computer speed, sequencing is winning by a mile. Sequencing throughput is currently around 200 to 300 billions of bases per run on a single sequencing machine, and is improving at a rate of about fivefold per year. In comparison, computer performance generally follows 'Moore’s Law', doubling only every 18 or 24 months. As the gap in performance widens, the question of how to design higher-throughput analysis pipelines becomes crucial. One option is to enhance and refine the algorithms to make better use of a fixed amount of computing power. Unfortunately, algorithmic breakthroughs of this kind, like scientific breakthroughs, are difficult to plan or foresee. The most practical option is to develop methods that make better use of multiple computers and processors in parallel. This presentation will describe some of my recent work using the distributed programming environment Hadoop/MapReduce in conjunction with cloud computing to dramatically accelerate several important computations in genomics, including short read mapping & genotyping, sequencing error correction, and de novo assembly of large genomes.


"Cosmic Microwave Background Data Analysis at the Peta-scale and Beyond," Julian Borrill (Lawrence Berkeley National Laboratory & University of California at Berkeley)

The analysis of Cosmic Microwave Background (CMB) data is an ongoing high performance computing challenge. For more than a decade now the size of CMB data sets has tracked Moore's Law, and we expect this to continue for at least the next 15 years. In this talk we will review the work done to date to follow this scaling, and discuss the steps we are taking to continue to do so to the peta-scale and beyond.


"Adaptive, Secure, and Scalable Distributed Data Outsourcing: A Vision Paper," Li Xiong, Slawomir Goryczka, Vaidy Sunderam (Emory University)


The growing trend towards grid computing and cloud computing provides enormous potential for enabling dynamic, distributed and data-intensive applications such as sharing and processing of large-scale scientific data. It also creates an increasing challenge for automatically and dynamically placing the data in the globally distributed computers or data centers in order to optimally utilize resources while minimizing user-perceived latency. This challenge is further complicated by the security and privacy constraints on the data that are potential sensitive. In this paper, we present our vision of an adaptive, secure, and scalable data outsourcing framework for storing and processing massive, dynamic, and potentially sensitive data using distributed resources. We identify the main technical challenges and present some preliminary solutions. The key idea of the framework is that it uniquely combines data partitioning, encryption, and data reduction to ensure data confidentiality and privacy while minimizing the cost for data shipping and computation. We believe the framework will provide a holistic con- ceptual foundation for secure data outsourcing that enables dynamic, distributed, and data-intensive applications and will open up many exciting research challenges.


"Towards Jungle Computing with Ibis/Constellation," Jason Maassen, Niels Drost, Henri E. Bal, Frank J. Seinstra (Vrije Universiteit)

The high-performance computing landscape is becoming more and more complex. Besides traditional supercomputers and clusters, scientists can also apply grid and cloud infrastructures. Moreover, the current integration of many-core technologies such as GPUs with such infrastructures adds to the complexity. To make matters worse, hardware availability, software heterogeneity, data distribution, and the need for scalability, commonly force scientists to use multiple computing platforms simultaneously: a true computing jungle.

In this paper we introduce Ibis/Constellation, a lightweight software platform specifically designed for distributed, heterogeneous and hierarchical computing environments. In Ibis/Constellation we assume that applications consist of several distinct (but somehow related) activities. These activities can be implemented independently using existing, well understood tools (e.g. MPI, CUDA, etc.). Ibis/Constellation is then used to construct the overall application by coupling the distinct activities. Using application defined labels in combination with context-aware work stealing}, Ibis/Constellation provides a simple and efficient mechanism for automatically mapping the activities to the appropriate hardware, taking heterogeneity and locality into account.

We show that an existing supernova detection application can be ported to Ibis/Constellation with little effort. By making small changes to the application defined labels, this application can run efficiently in three very different HPC computing environments: a distributed set of clusters, a large 48~core machine, and a GPU cluster.