BIG DATA
MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.
Hive: Hive is a Hadoop-based data warehousing-like framework . It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.
In Hadoop the only way to process data was through a MapReduce job. And not everyone knows to write MapReduce programs to process data. We are also very familiar using SQL to process data. So Hive is a tool which takes in SQL queries from users, converts it into MapReduce in the background to process data, as MapReduce is a powerful tool on Hadoop.
Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)
As the name itself gives you an idea, Pig is used to ingest any kind of data, whether structured or unstructured or
semi-structured. Pig is usually used to (ETL)shape your data. Example: Lets say your have 3 different files one is comma separated, the second one is tab separated and the third one is space separated, then with pig you can load all the three files in together and change the separators to a common separator. Also Pig has its own language called Pig Latin which also turns your Pig Latin script to a series of MapReduce jobs
HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily
Flume is a very powerful and fault tolerant tool used to aggregate data from various sources to Hadoop.
Sources can be websites where you have click stream data or log aggregation or anything really.
Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be
initiated after specified previous jobs on which it relies for data are completed
Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target
This is a tool which is used to import RDBMS data to Hadoop. Usually one would want to bring this type of data to
prepare a Data Lake on Hadoop. So Sqoop facilitates in bringing data to HDFS and all it needs to do that is your database connection URL, driver, username, password and a few more simple parameters which are easy to pass.
http://hadooptutorial.info/