Hadoop Cluster

Hadoop Guide (updated 16 Oct 2024) - Hadoop cluster is no longer available. This is just for information ONLY.

HPC Cluster currently does not provide an Apache Hadoop cluster. In the past, we maintained an experimental cluster, but that is no longer available.

-------------------------------------------------- The following remains available to indicate a previous service offering ----------------------------------------------

The software are managed with the Cloudera Distribution Hadoop (CDH), which makes it easier to maintain the HDFS/MapReduce, and HBase. For the version, storage space, number of nodes, please see the section "HADOOP Cluster View" below.

Important Notes

(very Imp) Please make sure to use the latest version of Hadoop header files, libraries and the jar files as they may be different in the sample examples below. For the version (e.g. CDH-<version>), please see the section "Hadoop Cluster View" below.
Also, the paths (both HOME and HDFS) used in the sample examples may not exactly match with yours. Please change them as required.

Accessing Hadoop Cluster

1. If you have never used our Hadoop Cluster, your CaseID would need to be added to Hadoop cluster user list. To get the account, Please email us at hpc-supportATcase.edu.
2. Login to hpcdata and enter your Case password when prompted
  - ssh -X <caseID>@hpcdata1.case.edu
3. Create a new directory "hadoop projects) in the /home/<CaseID> directory and cd into it.

- - mkdir /home/<CaseID>/hadoop-projects
  - cd /home/<CaseID>/hadoop-projects

Hadoop Cluster View

Access HDFS (Name Node) with links command from one of the cluster nodes:

links hpcdata1.priv.cwru.edu:8088

Content Excerpts:

X Apps

Y GB Memory

160 VCores

Press ESC -> File -> Exit

Some important ports (more at https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ports.html):

8088 - Cluster Overview

19888 - Job Tracker

8888 - Hue

11000 - Oozie

Packages in CDH

You can also check the CDH version and Native Libraries

hadoop checknative -a

output:

21/04/15 10:23:28 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native

21/04/15 10:23:28 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

Native library checking:

hadoop: true /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/lib/native/libhadoop.so.1.0.0

zlib: true /lib64/libz.so.1

zstd : true /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/lib/native/libzstd.so.1

snappy: true /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/lib/native/libsnappy.so.1

lz4: true revision:10301

bzip2: true /lib64/libbz2.so.1

openssl: true /lib64/libcrypto.so

ISA-L: true /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/lib/native/libisal.so.2

Hadoop Tutorials:

Hadoop Java: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/hadoop-java

Hadoop Streaming: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/hadoop-streaming

Hadoop Pipes: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/hadoop-pipes

Hadoop Hbase: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/hadoop-hbase

Apache Spark: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/apache-spark

Hadoop Pig: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/hadoop-pig

Hadoop Hive: https://sites.google.com/a/case.edu/hpcc/servers-and-storage/hadoop-guide-1/hadoop-hive

References for the Hadoop Examples:

[1] Tutorial sample: http://www.youtube.com/watch?v=1ArXR5cl9fk

[2] Hadoop Tutorial: https://ccp.cloudera.com/display/DOC/Hadoop+Tutorial

[3] Hadoop Hbase Tutorial: http://hadoopinterviews.com/data-improt-hbase-map-reduce/

[4] Apache Spark: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/spark.html

[5] Standford Workshop: http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

[6] Apache Spark (batch): http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

[7] Apache Pig: http://hortonworks.com/hadoop/pig/

[8] Cloudera Pig Scripts: http://archive.cloudera.com/cdh4/cdh/4/pig/start.html#pig-scripts

[9] Pig Tutorial: https://github.com/rohitsden/pig-tutorial

[10] Apache Hive: https://hive.apache.org/

[11] Apache Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/GettingStarted