What is Hadoop?
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. (Wikipedia)
Apache Hadoop three core components
- HDFS - general purpose distributed file system that can store any file format and at any scale. Fault tolerance through redundancy is in-built.
- YARN - cluster manager that controls resources like RAM, CPU, Disk, Network IO
- MapReduce Framework - a set of API's that allow developers to run code on cluster in parallel.
The rich ecosystem of Hadoop encompasses tools for various purposes. There are dozens of products. Below are a few examples. These products are built fault tolerance and scalability in mind.
- MapReduce: framework for general purpose cluster computing
- Hive: allows user to define structure and query data on HDFS through SQL like statements. Allows traditional BI tools to built analytics on HDFS data.
- Pig: Supported by scripting language Pig Latin, it allows to write ETL jobs on HDFS
- HBase: wide columnar parallel key-value database. Allows high throughput consistent read-write (millions of read/write per sec)
- Spark: more recent, and must faster framework for general purpose computing. It offers unified framework for batch processing, live stream data processing, graph data process, machine learning. It has a SQL 2003 compliant query processing engine.
- Kafka: Data ingestion layer that can collect live stream of data from thousands of sources, aggregate and channelize them to one more destination such as HDFS for storage, processing later such Spark streaming etc.
- Kerberos: authentication gatekeeper for Hadoop cluster
- Sentry: fine grained role based authorization to data and metadata stored on a Hadoop cluster
Key Use Cases
- Cheaper data storage solution
- Batch data processing
- BI Analytics
- Processing of live stream of data
- Machine learning
- Fast data read/write
Key feature for all use cases are high scalability and fault tolerance.