Hadoop

HDFS that stands for the Hadoop distributed file system. HDFS is the Hadoop version of that and that is the system that allows us to distribute the storage of big data across our cluster of computers so it makes all of the hard drives on our cluster look like one giant file system.And not only that it actually maintains redundant copies of that data.

YARN and that stands for yet another resource negotiator.So we talked about the data storage part of Hadoop.And there's also the data processing part of Hadoop. YARN is where the data processing starts to come into play So yarn is basically the system that manages the resources on your computing cluster.it's kind of the the heartbeat that keeps your cluster going.

Map Reduce which again is a piece of Hadoop proper and map reduce at a very high level is just a programming metaphor or programming model that allows you to process your data across an entire cluster.

Pig if you don't want to write Java or python map reduce code and you're more familiar with a scripting language that has sort of a SQL style syntax Pig is for you so Pig is a very high level programming API. So pig will actually transform that script into something that will run on map reduce

Hive is a way of actually taking SQL queries and making this distributed data that's just really sitting on your file system somewhere look like a SQL database.So for all intents and purposes it's just like a database.You can even connect to it through a shell client or ODBC or what have you.And actually execute SQL queries on the data that's stored on your Hadoop cluster even though it's not really a relational database under the hood.Hive a very useful API - useful interface for you to use.

Ambari and Apache Ambari is basically this thing that sits on top of everything and it just gives you a view of your cluster and lets you visualize what's running on your cluster What systems are using how much resources and also has some views in it that allow you to actually do things like execute hive queries or import databases into hive or execute Pig queries and things like that.

Now there are other technologies that do this for you and Ambari is what Hortonworks uses.There are competing distributions of Hadoop stacks out there,Hortonworks being one of them. Other ones include Cloudera and MapR but for Hortonworks they use Anbari.

Mesos isn't really part of Hadoop proper it's basically an alternative to yarn.So it too is a resource negotiator remember YARN is yet another resource negotiator Mesos is another one they basically solve the same problems in different ways there are of course pros and cons to using each one that we'll talk about later on.But Mesos is another potential way of managing the resources on your cluster.And there are ways of getting Mesos and YARN to work together if you need to as well.

Spark most exciting technologies in the Hadoop ecosystem It requires some programming and need to actually write your SPARK scripts using either Python or Java or the Scala programming language Scala being preferred.And it's also very versatile it can do things like handle SQL queries that can do machine learning across an entire cluster of information.It can actually handle streaming data in real time and all sorts of other cool stuff.

Tez similar to spark in that it also uses some of the same techniques as SPARK notably with something that's called a directed acyclic graph and this gives Tez a leg up on what map reduce does because it can produce more optimal plans for actually executing queries. Tez is usually used in conjunction with Hive to accelerate it..So you have an option there high through Tez can often be faster than high through map reduce.Both different means of optimizing queries to get a efficient answer from your cluster

HBase is what we call a NoSQL database.It is a columnar data store and you might have heard that term before it's basically a really really fast database meant for very large transaction rates so it's appropriate for example for hitting from a web application hitting from a Web site doing all types of transactions so HBase can actually expose the data that's stored on your cluster and maybe that data was transformed in some way by spark or map reduce or something else.

Apache's storm is basically a way of processing streaming data.So if you have streaming data from say sensors or web logs you can actually process that in real time using storm and spark streaming solves the same problem.Storm just does it in a slightly different way.So Apache storm's made for processing streaming data quickly in real time so it doesn't have to be a batch thing anymore.You can actually update your machine learning models or transform data into a database all in real time as it comes in.

Oozie is just a way of scheduling jobs on your cluster.If you have a task that needs to happen on your Hadoop cluster that involves many different steps and maybe many different systems. Oozie's a way of scheduling all of these things together into jobs that can be run on some sort of schedule.So when you have more complicated operations that require loading data into hive and then integrating that with Pig and maybe querying it with SPARK and then transforming the results into HBase Oozie can manage that all for you and make sure that it runs reliably on a consistent basis.

Zookeeper It's basically a technology for coordinating everything on your cluster.So it's it's the technology that can be used for keeping track of which nodes are up which nodes are down. It's a very reliable way of just kind of keeping track of shared states across your cluster that different applications can use and many many of these applications we've talked about rely on zookeeper to actually maintain reliable and consistent performance across the cluster.Even when a node randomly goes down.So zookeeper can be used for example for keeping track of who the current master node is or keeping track of who's up who's down what have you.And it's really more more extensible than that even.

Sqoop is a way of actually tying your Hadoop database into a relational database.Anything that can talk to ODBC or JDBC can be transformed by Sqoop into your HDFS file system so Sqoop is basically a connector between Hadoop and your legacy databases.

Flume.It's a way of actually transporting Web logs at a very large scale and very reliably to your cluster.So let's say you have a fleet of web servers Flume can actually listen to the web logs coming in from those web servers in real time and publish them into your cluster in real time for processing by something like storm or spark streaming.

Kafka It can basically collect data of any sort from a cluster of PCs from a cluster of web servers or whatever it is and broadcast that into your Hadoop cluster as well.So those are all three technologies that solve the problem of data ingestion.

Page updated

Google Sites

Report abuse