Apache Hadoop is an open-source software focused on providing reliable, scalable, and distributed computing, processing large data sets across clusters of computers using simple programming models. It is composed of four main modules: Hadoop Common, HDFS, YARN, and MapReduce. In this opportunity, it will be said about HDFS, a distributed file system that provides high-throughput access to application data, there is where all information is stored. It is part of the Data Lake concepts, but this is a matter for another opportunity.
Just to illustrate, here is a bit of how the Apache Hadoop Ecosystem is organized. Note that HDFS is the base for many frameworks.
https://phoenixnap.com/kb/apache-hadoop-architecture-explained
Before start to processing data on Hadoop, it is needed to put the files into HDFS, it is called Data Ingestion. Despite being simple, it is one of the most important steps from this process. There are many ways to do it. Using Python, Kafka, Flume, Shell Scripts, Sqoop, or ETL tools like Talend, and it is about this one will be discussed here. How to do Data Ingestion with the Talend Open Studio for Big Data using HDFSPut component.
Talend is an open-source ETL tool based on Java. Depending on which need to be done, there are different Talend tools. As this demonstration is being on Hadoop, the tool choose was Talend Open Studio for Big Data.
Procedure:
1- Right mouse button on the Hadoop cluster to create a Hadoop connection
2- Fill in the information
3- Choose your Hadoop distribution, YARN version, and select Enter manually Hadoop Services
4- Put your server address and Hadoop username
5- Right mouse button on Hadoop cluster connection which has been created right now and Create HDFS
6- Fill in the information
7- Choose how you want the files on HDFS. With header for example
8 - Step preparation is done!
9- Create a new job for executing data ingestion
10-Find HDFSput component
11-Property Type: Choose the HDFS connection, Local directory, HDFS directory and Overwrite file as always
12- Choose which files you want to copy. In my case, all of them. You can put the name of the file you want on the Filemask option, in this situation, just the files you put the name will be copied.
13 - Step job is done!
14-Now, Run,
15-And be happy!
Once the files are available in HDFS, they can be used by different frameworks in different ways. It can be organized using a dimensional model on Hive, mining with Python or any other machine learning tools, automatizing with Scala, Bash, Zookeeper, etc.
In conclusion, Hadoop keeps being an interesting way of processing, mining and discovering your data. In spite of having many other technologies such as Amazon AWS and Microsoft Azure, the techniques utilized on apache Hadoop can easily be used with these new technologies, once the concepts are the same. Hence, choosing one or another depends on the situation, depends on the necessity, cost-benefit, or many other reasons.
In the end, just enjoy all technology available and do the best work you can.