BIG DATA

Apache Hadoop is an excellent framework for processing, storing and analyzing large volumes of unstructured data - aka Big Data. But getting a handle on all the project’s myriad components and sub-components, with names like Pig and Mahout, can be a difficult.

Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data

MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

Hive: Hive is a Hadoop-based data warehousing-like framework . It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

                               In Hadoop the only way to process data was through a MapReduce job. And not everyone knows to write MapReduce                                       programs to process data. We are also very familiar using SQL to process data. So Hive is a tool which takes in SQL                                           queries from users, converts it into MapReduce in the background to process data, as MapReduce is a powerful tool on                                     Hadoop.

Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)

As the name itself gives you an idea, Pig is used to ingest any kind of data, whether structured or unstructured or 

                                semi-structured. Pig is usually used to (ETL)shape your data. Example: Lets say your have 3 different files one is comma                                  separated, the second one is tab separated and the third one is space separated, then with pig you can load all the three                                    files in together and change the separators to a common separator. Also Pig has its own language called Pig Latin which                                  also turns your Pig Latin script to a series of MapReduce jobs

HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily

                               Flume is a very powerful and fault tolerant tool used to aggregate data from various sources to Hadoop. 

                               Sources can be websites where you have click stream data or log aggregation or anything really.

Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop

Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be  

                            initiated after specified previous jobs on which it relies for data are completed

Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target 

                                This is a tool which is used to import RDBMS data to Hadoop. Usually one would want to bring this type of data to            

                                prepare a Data Lake on Hadoop. So Sqoop facilitates in bringing data to HDFS and all it needs to do that is your                                               database connection URL, driver, username, password and a few more simple parameters which are easy to pass.

http://hadooptutorial.info/