Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS - the Hadoop File System - and Hadoop implementation of MapReduce.
All active HPC users have an account on Peel. If you need an account, review the HPC Getting and Renewing Accounts page for instructions on how to get an account.
The Peel login nodes can be reached directly using the following command (NYU VPN required):
ssh <NetID>@peel.hpc.nyu.eduFor more details about logging into the Peel cluster read the Accessing HPC Systems page.
HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
To upload data to HDFS use one of the following commands:
This can be from any location, not just /scratch
To get data from HDFS, use one of the following commands:
To list files in HDFS, use the following command:
What Is MapReduce?
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
A MapReduce job splits a large data set into independent chunks and organizes them into key-value pairs for parallel processing. The mapping and reducing functions receive not just values, but (key, value) pairs.
Every MapReduce job consists of at least two parts:
The Mapper
The Reducer
Mapping Phase: Takes input as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output.
Reducing Phase: Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
MapReduce Word Count Example
The objective here is to count the number of occurrences of each word by using key-value pairs in python
Get the Mapper and Reducer program files
cp -r /share/apps/examples/hadoop-streaming $HOME/example/cd $HOME/example/hadoop-streamingGet the input file
cp -r /scratch/work/public/peel/tutorials/Tutorial1/example1 .Place the book.txt file on to HDFS
hadoop fs -put /home/<net_id>/example1/book.txt /user/<net_id>/book.txtAn example of how to run a Hadoop-streaming job is:
export HADOOP_LIPATH=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p0.6626826/libCheck output by accessing HDFS directories
hadoop fs -ls /user/<net_id>/example.outhadoop fs -cat /user/<net_id>/example.out/part-r-00000Alternatively, you can use the following:
hadoop fs -getmerge /user/<net_id>/example.out $HOME/output.txtcat $HOME/output.txtFor more information please visit the Hadoop User Guide page.
What is Spark?
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. (source: Wikipedia)
Launching an Interactive Spark Shell
Spark provides an interactive shell that gives a way to learn the API, as well as to analyze data sets interactively. Follow the below step on the peel cluster at the command prompt to connect to Spark Shell.
spark-shell --deploy-mode clientSpark Wordcount Example
The goal of this example is to process a word count task through interactive spark-shell using input file from HDFS.
Step 1: Copy input file to HDFS
hadoop fs -put /scratch/work/public/peel/tutorials/Tutorial3/input/animals.txt /user/<net_id>/Steps 2: Log in to interactive spark-shell using command "spark-shell" and run commands.
-bash-4.1$ spark-shell --deploy-mode clientval file = sc.textFile("/user/<net id>/<input file>");val counts = file.flatMap(file => file.split(" ")).map(word => (word, 1)).reduceByKey(_ + _);counts.collect().foreach(println);YARN is the resource manager and job scheduler in the Peel cluster. YARN allows you to use various data processing engines for batch, interactive, and real-time stream processing of data stored in HDFS.
Job Queues
The memory available to users’ Yarn containers is 4.45 TB in total. There is a queue named 'q1' created to accommodate users with a large memory requirement. Majority of users are in the 'default' queue which is guaranteed resources. If you want to be placed in 'q1', please contact us. To check which queue you are using, run a Spark application then go to the page All Yarn Applications, and look at the 'Queue' column for your application.
Application status and logs
Please find the list of current running apps using 'Yarn' script. Running the yarn script without any arguments prints the description for all commands.
$ yarn application -listTo kill a currently running app because the submitted app started malfunctioning or in worst case scenario, it's stuck in an infinite loop. Get the app ID and then kill it as given below
$ yarn application -kill <application_ID>To download an app logs for examination on the command line
$ yarn logs -applicationId <application_ID>Apache Zeppelin is a web-based interactive computational environment that could use Apache Spark as a backend. In some sense it is like the IPython Notebook. Zeppelin is installed on Peel.
Peel is a YARN cluster, not a standalone Spark cluster. Here is the process to work with Zeppelin on Peel.
# Create personal directories and copy the configuration over$ mkdir -p $HOME/zeppelin/{conf,logs,notebook,run,webapps} $ cp /share/apps/peel/zeppelin/0.9.0/conf/* $HOME/zeppelin/conf/See [users] section in $HOME/zeppelin/conf/shiro.ini for the default user/password and the instruction to change password. The procedures to start and stop a Zeppelin server on Peel login nodes are:
# Start a Zeppelin daemon$ /share/apps/peel/zeppelin/0.9.0/bin/zeppelin-daemon.sh --config $HOME/zeppelin/conf startZeppelin start at port 9178 [ OK ]Please remember to clean up when you are done, not to leave outdated processes hanging around
# Stop the daemon$ /share/apps/peel/zeppelin/0.9.0/bin/zeppelin-daemon.sh --config $HOME/zeppelin/conf stopZeppelin stop [ OK ]There are two Peel login nodes. Verify which login node the Zeppelin server is just started by using command 'hostname' at command line. For this example, 'hlog-1.hpc.nyu.edu' was used and the NYU VPN is established.
Open a new terminal on your computer, and enable SSH port forwarding by running the following command:
ssh -L 4321:localhost:9178 <net_id>@hlog-1.hpc.nyu.eduFor Windows machines, use PuTTY with the following configurations:
4321:localhost:9178 <net_id>@hlog-1.hpc.nyu.eduNote that the Zeppelin port is randomized, it could be different next time you start a new server. The number 9178 was generated when the Zeppelin daemon was started, as shown above.
Now it is time to open a new web browser tab, input the following address to get the Zeppelin UI:
Please read