Big Data with Hadoop

Post date: Sep 20, 2014 7:54:12 AM

Hadoop HDFS, a distributed file system, is the underpinnings of many Big Data solutions.

Hadoop MapReduce/YARN, provides a distributed data processing engine and API to implement flow based transform functions on data files stored in Hadoop HDFS

Many tools have been developed to make using MapReduce/YARN easier, such as Pig(Pig Latin), Hive(SQL subset).

Other tools have bypassed MapReduce/YARN altogether, such as HBASE and Cassandra.

A new wave of tools is addressing the HDFS latency issues by moving the data into memory.

Tachyon provides a caching layer for Hadoop HDFS, improving access latency of HDFS.

Spark provides an API to a data processing engine that uses data caching to speed up processing.

Spark provides further extensions such as stream processing.

Shark sits on top of Spark and provides a SQL interface to the Spark API.

Storm is a similar tool to Spark but uses data flow techniques

http://lambda-architecture.net/

Page updated

Google Sites

Report abuse