Getting Started on Peel

What is Hadoop?

Accessing the Peel Hadoop Cluster

HDFS

HDFS Commands

Interactive Notebook Interface

Zeppelin

Best Practices

NOTICE: This page refers to the Peel Hadoop Cluster which was retired in 2023 and is in the process of being archived. Its contents should not be referenced for Dataproc Courses.

What is Hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS - the Hadoop File System - and Hadoop implementation of MapReduce.

Accessing the Peel Hadoop Cluster

All active HPC users have an account on Peel. If you need an account, review the HPC Getting and Renewing Accounts page for instructions on how to get an account.

The Peel login nodes can be reached directly using the following command (NYU VPN required):

ssh <NetID>@peel.hpc.nyu.edu

For more details about logging into the Peel cluster read the Accessing HPC Systems page.

HDFS

HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS Commands

To upload data to HDFS use one of the following commands:

hadoop fs -put /scratch/<path_to_file> <hdfs_name>
- This can be from any location, not just /scratch
hadoop fs -copyFromLocal /scratch/<path_to_file> <hdfs_name>
hdfs dfs -put /scratch/<path_to_file> <hdfs_name>

To get data from HDFS, use one of the following commands:

hadoop fs -get <hdfs_name> /scratch/<path_to_file>
hadoop fs -copyToLocal <hdfs_name> /scratch/<path_to_file>

To list files in HDFS, use the following command:

hadoop fs -ls

Computations on Peel

MapReduce

What Is MapReduce?

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce job splits a large data set into independent chunks and organizes them into key-value pairs for parallel processing. The mapping and reducing functions receive not just values, but (key, value) pairs.

Every MapReduce job consists of at least two parts:

The Mapper
The Reducer

Mapping Phase: Takes input as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output.

Reducing Phase: Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

MapReduce Word Count Example

The objective here is to count the number of occurrences of each word by using key-value pairs in python

Get the Mapper and Reducer program files

cp -r /share/apps/examples/hadoop-streaming $HOME/example/cd $HOME/example/hadoop-streaming

Get the input file

cp -r /scratch/work/public/peel/tutorials/Tutorial1/example1 .

Place the book.txt file on to HDFS

hadoop fs -put /home/<net_id>/example1/book.txt /user/<net_id>/book.txt

An example of how to run a Hadoop-streaming job is:

export HADOOP_LIPATH=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p0.6626826/lib
hadoop jar $HADOOP_LIPATH/hadoop-mapreduce/hadoop-streaming.jar -numReduceTasks 2 -file $HOME/example/hadoop-streaming/src -mapper src/mapper.py -reducer src/reducer.py -input /user/<net_id>/book.txt -output /user/<net_id>/example.out

Check output by accessing HDFS directories

hadoop fs -ls /user/<net_id>/example.outhadoop fs -cat /user/<net_id>/example.out/part-r-00000

Alternatively, you can use the following:

hadoop fs -getmerge /user/<net_id>/example.out $HOME/output.txtcat $HOME/output.txt

For more information please visit the Hadoop User Guide page.

Spark

What is Spark?

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. (source: Wikipedia)

Launching an Interactive Spark Shell

Spark provides an interactive shell that gives a way to learn the API, as well as to analyze data sets interactively. Follow the below step on the peel cluster at the command prompt to connect to Spark Shell.

spark-shell --deploy-mode client

Spark Wordcount Example

The goal of this example is to process a word count task through interactive spark-shell using input file from HDFS.

Step 1: Copy input file to HDFS

hadoop fs -put /scratch/work/public/peel/tutorials/Tutorial3/input/animals.txt /user/<net_id>/

Steps 2: Log in to interactive spark-shell using command "spark-shell" and run commands.

-bash-4.1$ spark-shell --deploy-mode clientval file = sc.textFile("/user/<net id>/<input file>");val counts = file.flatMap(file => file.split(" ")).map(word => (word, 1)).reduceByKey(_ + _);counts.collect().foreach(println);

YARN Scheduler

YARN is the resource manager and job scheduler in the Peel cluster. YARN allows you to use various data processing engines for batch, interactive, and real-time stream processing of data stored in HDFS.

Job Queues

The memory available to users’ Yarn containers is 4.45 TB in total. There is a queue named 'q1' created to accommodate users with a large memory requirement. Majority of users are in the 'default' queue which is guaranteed resources. If you want to be placed in 'q1', please contact us. To check which queue you are using, run a Spark application then go to the page All Yarn Applications, and look at the 'Queue' column for your application.

Application status and logs

Please find the list of current running apps using 'Yarn' script. Running the yarn script without any arguments prints the description for all commands.

$ yarn application -list

To kill a currently running app because the submitted app started malfunctioning or in worst case scenario, it's stuck in an infinite loop. Get the app ID and then kill it as given below

$ yarn application -kill <application_ID>

To download an app logs for examination on the command line

$ yarn logs -applicationId <application_ID>

Interactive Notebook Interface

Zeppelin

Apache Zeppelin is a web-based interactive computational environment that could use Apache Spark as a backend. In some sense it is like the IPython Notebook. Zeppelin is installed on Peel.

Peel is a YARN cluster, not a standalone Spark cluster. Here is the process to work with Zeppelin on Peel.

# Create personal directories and copy the configuration over$ mkdir -p $HOME/zeppelin/{conf,logs,notebook,run,webapps} $ cp /share/apps/peel/zeppelin/0.9.0/conf/* $HOME/zeppelin/conf/

See [users] section in $HOME/zeppelin/conf/shiro.ini for the default user/password and the instruction to change password. The procedures to start and stop a Zeppelin server on Peel login nodes are:

# Start a Zeppelin daemon$ /share/apps/peel/zeppelin/0.9.0/bin/zeppelin-daemon.sh --config $HOME/zeppelin/conf startZeppelin start at port 9178 [ OK ]

Please remember to clean up when you are done, not to leave outdated processes hanging around

# Stop the daemon$ /share/apps/peel/zeppelin/0.9.0/bin/zeppelin-daemon.sh --config $HOME/zeppelin/conf stopZeppelin stop [ OK ]

There are two Peel login nodes. Verify which login node the Zeppelin server is just started by using command 'hostname' at command line. For this example, 'hlog-1.hpc.nyu.edu' was used and the NYU VPN is established.

Open a new terminal on your computer, and enable SSH port forwarding by running the following command:

ssh -L 4321:localhost:9178 <net_id>@hlog-1.hpc.nyu.edu

For Windows machines, use PuTTY with the following configurations:

4321:localhost:9178 <net_id>@hlog-1.hpc.nyu.edu

Note that the Zeppelin port is randomized, it could be different next time you start a new server. The number 9178 was generated when the Zeppelin daemon was started, as shown above.

Now it is time to open a new web browser tab, input the following address to get the Zeppelin UI:

http://localhost:4321

Best Practices

Please read

Page updated

Report abuse