Hadoop/Spark (outdated)

Please refer to this updated URL for setting up Hadoop/Spark

Resources

HortonWorks training:

HortonWorks Hadoop Essentials 1-day Training (registration first)
Other HortonWorks training materials (Admin, Pig/Hive, Data Science) are available at blackboard

Amazon AWS Training:

Cloudera Hadoop Distribution Download:

Cloudera quick start

Hadoop Tutorial (Based on Cloudera Hadoop Distribution)

(Run a Hadoop environment with a VM Based Distribution)

(Other option: You can follow HortonWorks training materials, download its VM and start with Hadoop 101 tutorial. )

Standalone Virtual Machine with Hadoop Installation

1) Virtual Machine Installation

Download and install a virtual machine client

Candidates are:

VirtualBox (recommended): https://www.virtualbox.org/wiki/Downloads
VMware (trial version) https://www.vmware.com/products/workstation

Install the binary/executables. Select yes to all options

2) Download a Virtual machine image

Options are:

QuickStart VM 5.8 from Cloudera (recommended: Virtualbox image)
https://www.cloudera.com/downloads/quickstart_vms/5-13.html
Hortonworks Hadoop Sandbox

Download a distribution matching with your virtual machine client

Unzip the image file.

3) Start your VM client and load the Virtual machine image:

Example: On VirtualBox select File -> Import Appliance

Click the folder icon and select the Virtual machine image extracted from step 2 (There are 2 extracted files; select the .ovf file).

Recommended:

Use a higher number of CPUs than default (1), such as 4, 6, 8+ (depending on how many CPUs are in your machine).

Wait for VirtualBox to load the image.

Start the virtual machine by double-clicking its name in the menu.

Once the virtual machine has started, select (on the menu for the virtual machine) Devices -> Shared Clipboard -> Bidirectional.

This will enable you to copy and paste text data from your native OS and the virtual machine.

Hadoop on Virtual Machine

1) Start Firefox and access the URL: http://quickstart.cloudera/#/

2) To test the Hadoop Distributed File System (HDFS), open a Terminal and execute:

hdfs dfs -ls /

You should see multiple directories listed by HDFS

To list files under user cloudera in HDFS use:

hdfs dfs -ls /user/cloudera

To view content a file in HDFS use:

hdfs dfs -cat path_in_hdfs

e.g.:

hdfs dfs -cat /user/cloudera/mytext.txt

3) To test MapReduce availability, open a terminal and execute:

yarn node –list

You should see the list of node (1 for the VM distribution).

Using Hue UI (Web Graphical Interface):

1) In Firefox, access http://localhost:8888

2) Use cloudera as both username and password.

3) On the top right corner, you can use File Browser and Web Browser.

4) View and upload/delete files as you need.

Lists of commands used in class tutorial (Word

(Download the content of a sample text file)

147 wget http://www.textfiles.com/anarchy/alko170.txt

(List the content of the current directory)

148 ls -lt

(List the content of some directories on HDFS)

150 hdfs dfs -ls /user

151 hdfs dfs -ls /user/cloudera

(Put the text file into HDFS)

151 hdfs dfs -put alko170.txt /user/cloudera/myalkocopy.txt

(Open a text file and copy the content of the program source into the text editor

152 gedit WordCount.java

(Pasting the content of the WordCount.java in the file and then close the editor)

(Create a temporary build directory to store the compiled

153 mkdir build

(Compiling the code into Java bytecode)

154 javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* WordCount.java -d build -Xlint

(Package the code into a JAR archive file)

161 jar -cvf WordCount.jar -C build/ .

167 ls -lt

(You should see there is a WordCount.jar file just being created)

(Execute the WordCount MapReduce job with 2 parameters - input and output path)

169 hadoop jar WordCount.jar WordCount /user/cloudera/alko170.txt /user/cloudera/wordcountoutput

170 history

(View the content/result)

173 hdfs dfs -cat /user/cloudera/wordcountoutput/*

(Remove the output directory)

174 hdfs dfs -rm -r /user/cloudera/wordcountoutput

175 history

The following instructions are for Eclipse in Cloudera VM:

Follow a simple Word Count Example:

https://www.youtube.com/watch?v=l3MssCo2eSU

The source code for the main program can be found at:

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

To add JAR libraries/dependencies:

1) Start Eclipse

2) Select File -> New -> Project

3) Name your project WordCount

4) Click Finish

5) Rightclick on your WordCount project on the viewer and select properties.

6) Select Java Build Path -> Add External Jars

7) Add all jars available in /usr/lib/hadoop, /usr/lib/hadoop-mapreduce and /usr/lib/hadoop/client

(Use Shift-Select )

8) Click Ok

9) Create a new Java class WordCount.java for your project, and copy the source code from the Word Count Example.

To export a JAR file (for running the job from command line):

1) Right click on your project

2) Select Export -> Java -> Jar file

3) Select path and name for your Jar file, e.g., /home/cloudera/examples/wordcount.jar

To add input files for your program:

You will need to create two HDFS folders to run the program, an input folder, and an ouput folder. e.g.:

$ hadoop fs -mkdir -p /user/cloudera/wordcount/input

$ hadoop fs -mkdir /user/cloudera/wordcount/output

2) Download an example text file for your WordCount project and copy it into input folder.

cd /home/cloudera/examples

wget http://textfiles.com/computers/buyguide.txt

hadoop fs -put /home/cloudera/examples/buyguide.txt /user/cloudera/wordcount/input

The last command moves a local file to HDFS. The second-to-last argument is the local file path, the last argument is the target path in HDFS (usually just the directory is suffficient), you do not have to specify a new name.

To test your program in Eclipse:

1) Select Run -> Run configuration

2) Under the tab Argument type program arguments:

/home/cloudera/workspace/MyWordCount/input /home/cloudera/workspace/Wordcount/output

Run your program using the run button.

Notice that after everyone, you can view your output as part-0000 in the output directory. Make sure to delete the output directory so that the next run can execute (it can't overwrite existing file).

To test your program in cmd line:

$ hadoop jar /home/cloudera/examples/wordcount.jar WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output

This means that to your program the runtime args[0] is user/cloudera/wordcount/input and args[1] is /user/cloudera/wordcount/output

Or follow the following instruction below (you have already installed all required programs).

https://github.com/chenmiao/Big_Data_Analytics_Web_Text/wiki/Hadoop-with-Cloudera-VM-%28the-Word-Count-Example%29

To run multiple jobs (sequentially) using a single JAR:

See MyWordCount2.java in the main() function where 2 jobs are created.

Hadoop Streaming

https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html

Download the python programs from below at the end of the page.

To run a streaming program, execute WordCount as following on the command line:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files WordCountMapper.py,WordCountReducer.py -input /user/cloudera/myalko-copy.txt -output /user/cloudera/pythonword -mapper WordCountMapper.py -reducer WordCountReducer.py

The command line argument achieves the exact result as the above example using JAR file.

Hadoop streaming execution uses keyword arguments (-option value_args).

Arguments:

-file: the file we want to package/send to every node to distribute. You want to include both your Mapper and Reducer file; the path should be local. For Amazon EMR, use the Amazon S3 path to them.

-files: the same as -file, but you can use comma separated list of paths.

-input: the path of the input. You can include multiple -input parameters to include input

-output: the path of the output. The last directory level entry must not exist; Hadoop will check it.

-mapper: the name of the map function along with parameters (you can surround it in double quotes"

-reducer: the name of the map function along with parameters (you can surround it in double quotes"

-numReduceTasks: the number of reduce tasks to be used.

Amazon AWS Terminate Cluster

Follow following steps to terminate the AWS cluster after working. If you'll leave cluster running, you'll soon run out of free credits and may get charged on YOUR payment method.

Report abuse