Hadoop/Spark (outdated)
Please refer to this updated URL for setting up Hadoop/Spark
Resources
HortonWorks training:
HortonWorks Hadoop Essentials 1-day Training (registration first)
Other HortonWorks training materials (Admin, Pig/Hive, Data Science) are available at blackboard
Amazon AWS Training:
Cloudera Hadoop Distribution Download:
Hadoop Tutorial (Based on Cloudera Hadoop Distribution)
(Run a Hadoop environment with a VM Based Distribution)
(Other option: You can follow HortonWorks training materials, download its VM and start with Hadoop 101 tutorial. )
Standalone Virtual Machine with Hadoop Installation
1) Virtual Machine Installation
Download and install a virtual machine client
Candidates are:
VirtualBox (recommended): https://www.virtualbox.org/wiki/Downloads
VMware (trial version) https://www.vmware.com/products/workstation
Install the binary/executables. Select yes to all options
2) Download a Virtual machine image
Options are:
QuickStart VM 5.8 from Cloudera (recommended: Virtualbox image)
Download a distribution matching with your virtual machine client
Unzip the image file.
3) Start your VM client and load the Virtual machine image:
Example: On VirtualBox select File -> Import Appliance
Click the folder icon and select the Virtual machine image extracted from step 2 (There are 2 extracted files; select the .ovf file).
Recommended:
Use a higher number of CPUs than default (1), such as 4, 6, 8+ (depending on how many CPUs are in your machine).
Wait for VirtualBox to load the image.
Start the virtual machine by double-clicking its name in the menu.
Once the virtual machine has started, select (on the menu for the virtual machine) Devices -> Shared Clipboard -> Bidirectional.
This will enable you to copy and paste text data from your native OS and the virtual machine.
Hadoop on Virtual Machine
1) Start Firefox and access the URL: http://quickstart.cloudera/#/
2) To test the Hadoop Distributed File System (HDFS), open a Terminal and execute:
hdfs dfs -ls /
You should see multiple directories listed by HDFS
To list files under user cloudera in HDFS use:
hdfs dfs -ls /user/cloudera
To view content a file in HDFS use:
hdfs dfs -cat path_in_hdfs
e.g.:
hdfs dfs -cat /user/cloudera/mytext.txt
3) To test MapReduce availability, open a terminal and execute:
yarn node –list
You should see the list of node (1 for the VM distribution).
Using Hue UI (Web Graphical Interface):
1) In Firefox, access http://localhost:8888
2) Use cloudera as both username and password.
3) On the top right corner, you can use File Browser and Web Browser.
4) View and upload/delete files as you need.
Lists of commands used in class tutorial (Word
(Download the content of a sample text file)
147 wget http://www.textfiles.com/anarchy/alko170.txt
(List the content of the current directory)
148 ls -lt
(List the content of some directories on HDFS)
150 hdfs dfs -ls /user
151 hdfs dfs -ls /user/cloudera
(Put the text file into HDFS)
151 hdfs dfs -put alko170.txt /user/cloudera/myalkocopy.txt
(Open a text file and copy the content of the program source into the text editor
152 gedit WordCount.java
(Pasting the content of the WordCount.java in the file and then close the editor)
(Create a temporary build directory to store the compiled
153 mkdir build
(Compiling the code into Java bytecode)
154 javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* WordCount.java -d build -Xlint
(Package the code into a JAR archive file)
161 jar -cvf WordCount.jar -C build/ .
167 ls -lt
(You should see there is a WordCount.jar file just being created)
(Execute the WordCount MapReduce job with 2 parameters - input and output path)
169 hadoop jar WordCount.jar WordCount /user/cloudera/alko170.txt /user/cloudera/wordcountoutput
170 history
(View the content/result)
173 hdfs dfs -cat /user/cloudera/wordcountoutput/*
(Remove the output directory)
174 hdfs dfs -rm -r /user/cloudera/wordcountoutput
175 history
The following instructions are for Eclipse in Cloudera VM:
Follow a simple Word Count Example:
https://www.youtube.com/watch?v=l3MssCo2eSU
The source code for the main program can be found at:
To add JAR libraries/dependencies:
1) Start Eclipse
2) Select File -> New -> Project
3) Name your project WordCount
4) Click Finish
5) Rightclick on your WordCount project on the viewer and select properties.
6) Select Java Build Path -> Add External Jars
7) Add all jars available in /usr/lib/hadoop, /usr/lib/hadoop-mapreduce and /usr/lib/hadoop/client
(Use Shift-Select )
8) Click Ok
9) Create a new Java class WordCount.java for your project, and copy the source code from the Word Count Example.
To export a JAR file (for running the job from command line):
1) Right click on your project
2) Select Export -> Java -> Jar file
3) Select path and name for your Jar file, e.g., /home/cloudera/examples/wordcount.jar
To add input files for your program:
You will need to create two HDFS folders to run the program, an input folder, and an ouput folder. e.g.:
$ hadoop fs -mkdir -p /user/cloudera/wordcount/input
$ hadoop fs -mkdir /user/cloudera/wordcount/output
2) Download an example text file for your WordCount project and copy it into input folder.
cd /home/cloudera/examples
wget http://textfiles.com/computers/buyguide.txt
hadoop fs -put /home/cloudera/examples/buyguide.txt /user/cloudera/wordcount/input
The last command moves a local file to HDFS. The second-to-last argument is the local file path, the last argument is the target path in HDFS (usually just the directory is suffficient), you do not have to specify a new name.
To test your program in Eclipse:
1) Select Run -> Run configuration
2) Under the tab Argument type program arguments:
/home/cloudera/workspace/MyWordCount/input /home/cloudera/workspace/Wordcount/output
Run your program using the run button.
Notice that after everyone, you can view your output as part-0000 in the output directory. Make sure to delete the output directory so that the next run can execute (it can't overwrite existing file).
To test your program in cmd line:
$ hadoop jar /home/cloudera/examples/wordcount.jar WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output
This means that to your program the runtime args[0] is user/cloudera/wordcount/input and args[1] is /user/cloudera/wordcount/output
Or follow the following instruction below (you have already installed all required programs).
To run multiple jobs (sequentially) using a single JAR:
See MyWordCount2.java in the main() function where 2 jobs are created.
Hadoop Streaming
https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html
Download the python programs from below at the end of the page.
To run a streaming program, execute WordCount as following on the command line:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files WordCountMapper.py,WordCountReducer.py -input /user/cloudera/myalko-copy.txt -output /user/cloudera/pythonword -mapper WordCountMapper.py -reducer WordCountReducer.py
The command line argument achieves the exact result as the above example using JAR file.
Hadoop streaming execution uses keyword arguments (-option value_args).
Arguments:
-file: the file we want to package/send to every node to distribute. You want to include both your Mapper and Reducer file; the path should be local. For Amazon EMR, use the Amazon S3 path to them.
-files: the same as -file, but you can use comma separated list of paths.
-input: the path of the input. You can include multiple -input parameters to include input
-output: the path of the output. The last directory level entry must not exist; Hadoop will check it.
-mapper: the name of the map function along with parameters (you can surround it in double quotes"
-reducer: the name of the map function along with parameters (you can surround it in double quotes"
-numReduceTasks: the number of reduce tasks to be used.
Amazon AWS Terminate Cluster
Follow following steps to terminate the AWS cluster after working. If you'll leave cluster running, you'll soon run out of free credits and may get charged on YOUR payment method.