Research Theme‎ > ‎

Data Parallel Programming Paradigm

  1. Map/Reduce
    • The Google MapReduce framework is implemented in C++ with interfaces in Python and Java.
    • The Hadoop project is a free open source Java MapReduce implementation.
    • Twister is an open source Java MapReduce implementation that supports iterative MapReduce computations efficiently.
    • Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and other languages.
    • Aster Data Systems nCluster In-Database MapReduce supports Java, C, C++, Perl, and Python algorithms integrated into ANSI SQL.
    • GridGain is a free open source Java MapReduce implementation.
    • Phoenix is a shared-memory implementation of MapReduce implemented in C.
    • FileMap is an open version of the framework that operates on files using existing file-processing tools rather than tuples.
    • MapReduce has also been implemented for the Cell Broadband Engine, also in C.
    • Mars:MapReduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA.
    • Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores.
    • CouchDB uses a MapReduce framework for defining views over distributed documents and is implemented in Erlang.
    • Skynet is an open source Ruby implementation of Google’s MapReduce framework
    • Disco is an open source MapReduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python.
    • Misco is an open source MapReduce designed for mobile devices and is implemented in Python.
    • Qizmt is an open source MapReduce framework from MySpace written in C#.
    • The open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop MapReduce engine.)
    • The Holumbus Framework: Distributed computing with MapReduce in Haskell Holumbus-MapReduce
    • BashReduce: MapReduce written as a Bash script written by Erik Frey of Last.fm
    • MapReduce for Go
    • Meguro - a Javascript MapReduce framework
    • MongoDB is a scalable, high-performance, open source, schema-free, document-oriented database. Written in C++ that features MapReduce
    • mapReduce provides R-like implementation that demostrates the simplicity of the mapReduce pattern in a functional programming language
    • RHIPE integrates the R statistics language environment with Hadoop and makes it possible to code map-reduce algorithms in R.
    • Parallel::MapReduce is a CPAN module providing experimental MapReduce functionality for Perl.
    • MapReduce on volunteer computing
    • Secure MapReduce
    • MapReduce implemented in MPI
      • MapReduce with MPI implementation from Sandia: No fault tolerance or data redundancy
      • MapReduce implementation using MPI from IU
      • T. Tu, etc. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1–12, Piscataway, NJ, USA, 2008.
        • it described how MapReduce could be implemented on top of the ubiquitous distributed-memory MPI, and how the intermediate data-shuffle operation is conceptually identical to the familiar MPI Alltoall operation.
  2. Hadoop  
      • Hadoop Common: The common utilities that support the other Hadoop subprojects.
      • Avro: A data serialization system that provides dynamic integration with scripting languages.
      • Chukwa: A data collection system for monitoring large distributed systems.
      • HBase: A scalable, distributed database that supports structured data storage for large tables.
      • HDFS: A distributed file system that provides high throughput access to application data.
      • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
      • MapReduce: A software framework for distributed processing of large data sets on compute clusters.
      • Pig: A high-level data-flow language and execution framework for parallel computation.
      • ZooKeeper: A high-performance coordination service for distributed applications.
      • Hama: a distributed scientific package on Hadoop for massive matrix and graph data
      • OozieHadoop workflow
      • Sqoop: import data from relational databases into Hadoop
      • HadoopDB
      • BigTable [Paper][Video]
      • HyperTable 
      • Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. 
      • HAMAKEdata flow instructions: fold and foreach
      • MRSim: a Hadoop simulator
      • Mumak: Map-Reduce Simulator, ppt
      • MRPerf: A Simulator for MapReduce
      • myHadoop: Hadoop on HPC clusters
      • Hadoop on Demand
      • Hadoop online Prototype, paper
      • Hadoop workflow survey
    1. Microsoft data parallel programming
    2. All-Paris
    3. Sector/Sphere
    4. Mortar: Wide-Scale Stream Processing
    5. PACT
    6. Frenetic: a network programming language
    7. PADS: processing ad hoc data sources
    8. Sawzall
    9. A list of Key-Value stores
    10. Data intensive workflow
      1. Pwrake and G-Farm