R and Hadoop

Helpful site: http://backtest-with-r.blogspot.com/

https://groups.google.com/forum/#!forum/rhipe

http://ksssblogs.blogspot.com/?view=flipcard

Hadoop installation in Ubuntu 16.4 LTS

1. https://medium.com/@ujadhav25/installation-of-hadoop-2-7-3-5586a8634a18#.csmwmv7og

2. http://log.malchiodi.com/2015/12/09/installing-hadoop-271-from-scratch-2015-version/

3. http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php

Configuration Files are the files which are located in the extracted tar.gz file in the etc/hadoop/ directory.

All Configuration Files in Hadoop are listed below,

1) HADOOP-ENV.sh->>It specifies the environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop). We know that Hadoop framework is wriiten in Java and uses JRE so one of the environment variable in Hadoop Daemons is $Java_Home in Hadoop-env.sh.

2) CORE-SITE.XML->>It is one of the important configuration files which is required for runtime environment settings of a Hadoop cluster.It informs Hadoop daemons where the NAMENODE runs in the cluster. It also informs the Name Node as to which IP and ports it should bind.

3) HDFS-SITE.XML->>It is one of the important configuration files which is required for runtime environment settings of a Hadoop. It contains the configuration settings for NAMENODE, DATANODE, SECONDARYNODE. It is used to specify default block replication. The actual number of replications can also be specified when the file is created,

4) MAPRED-SITE.XML->>It is one of the important configuration files which is required for runtime environment settings of a Hadoop. It contains the configuration settings for MapReduce . In this file, we specify a framework name for MapReduce, by setting the MapReduce.framework.name.

5) Masters->>It is used to determine the master Nodes in Hadoop cluster. It will inform about the location of SECONDARY NAMENODE to Hadoop Daemon.

The Mater File on Slave node is blank.

6) Slave->>It is used to determine the slave Nodes in Hadoop cluster.

The Slave file at Master Node contains a list of hosts, one per line.

The Slave file at Slave server contains IP address of Slave nodes.

Follow the link to learn more about configuration files in Hadoop

1. Hadoop Configuration: ~/.bashrc config

# Set Hadoop-related environment variables

export HADOOP_PREFIX=/usr/local/hadoop

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_MAPRED_HOME=${HADOOP_HOME}

export HADOOP_COMMON_HOME=${HADOOP_HOME}

export HADOOP_HDFS_HOME=${HADOOP_HOME}

export YARN_HOME=${HADOOP_HOME}

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

# Native path

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"

# Java path

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

# For configuring the Hadoop libraries,

export PKG_CONFIG_PATH="/usr/local/lib"

export LD_LIBRARY_PATH="/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64:/usr/lib/jvm/java-8_openjdk-amd64/jre/lib/amd64/server"

export HADOOP="/usr/local/hadoop"

export HADOOP_BIN="/usr/local/hadoop/bin"

2. Sometime when you start up again Hadoop, you probably missing datanode, use:

sudo mv /usr/local/hadoop/hadoop_store/hdfs/datanode /usr/local/hadoop/hadoop_store/hdfs/datanode1

sudo mkdir /usr/local/hadoop/hadoop_store/hdfs/datanode

hadoop namenode -format

start-all.sh

jps

THEN IT WORKS!

Google it, you can have link source: http://stackoverflow.com/questions/11889261/datanode-process-not-running-in-hadoop

3. Rhadoop Installation Issues:

Of course, to install Rhadoop you need to install packages (search for it from internet), however you may be in trouble with rJava installation. This error could be: libjvm.so: cannot open shared object file

Its solution can find here: http://solaimurugan.blogspot.com/2015/11/rhadoop-integration-isssues.html

First: locate libjvm.so

Then you know where libjvm.so is, the make shared link,

sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so /usr/lib/

OK?

4. Running wordcount example in hadoop (java-eclipse)

source: https://portal.futuresystems.org/manual/hadoop-wordcount

1. Download and unzip WordCount under $HADOOP_HOME

or you can download it from the attached file below

Assuming you start SalsaHadoop/Hadoop with setting $HADOOP_HOME=~/hadoop-0.20.203.0, and are running the master node on i55, download and unzip the WordCount source code from Big Data for Science tutorial under $HADOOP_HOME.

[taklwu@i55 ~]$ cd $HADOOP_HOME [taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-WordCount.zip [taklwu@i55 hadoop-0.20.203.0]$ unzip Hadoop-WordCount.zip

2. Execute: Hadoop-WordCount

First, we need to upload the input files (any text format file) into Hadoop distributed file system (HDFS):

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -put $HADOOP_HOME/Hadoop-WordCount/input/ input [taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls input

Here, $HADOOP_HOME/Hadoop-WordCount/input/ is the local directory where the program inputs are stored. The second "input" represents the remote destination directory on the HDFS.

After uploading the inputs into HDFS, run the WordCount program with the following commands. We assume you have already compiled the word count program.

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop jar $HADOOP_HOME/Hadoop-WordCount/wordcount.jar WordCount input output

If Hadoop is running correctly, it will print hadoop running messages similar to the following:

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. 11/11/02 18:34:46 INFO input.FileInputFormat: Total input paths to process : 1 11/11/02 18:34:46 INFO mapred.JobClient: Running job: job_201111021738_0001 11/11/02 18:34:47 INFO mapred.JobClient: map 0% reduce 0% 11/11/02 18:35:01 INFO mapred.JobClient: map 100% reduce 0% 11/11/02 18:35:13 INFO mapred.JobClient: map 100% reduce 100% 11/11/02 18:35:18 INFO mapred.JobClient: Job complete: job_201111021738_0001 11/11/02 18:35:18 INFO mapred.JobClient: Counters: 25 ...

3. Monitoring Hadoop

We can also monitor the job status using lynx, a text browser, on i136 based Hadoop monitoring console. Assuming the Hadoop Jobtracker is running on i55:9003, you can type:

[taklwu@i136 ~]$ lynx i55:9003

4. Check the result

After finishing the job, please use theses commands to check the output:

[taklwu@i55 ~]$ cd $HADOOP_HOME [taklwu@i55 ~]$ bin/hadoop fs -ls output [taklwu@i55 ~]$ bin/hadoop fs -cat output/*

Here, "output" is the HDFS directory where the result stored. The result will look like the following:

you." 15 you; 1 you? 2 you?" 23 young 42

5. Finishing the Map-Reduce process

After finishing the job, please use this command to kill the HDFS and Map-Reduce daemon:

[taklwu@i55 hadoop-0.20.203.0]$ bin/stop-all.sh

6. Rhipe installation

~/.bashrc

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

export PKG_CONFIG_PATH=/usr/local/lib/

export HADOOP=/usr/local/hadoop

export HADOOP_LIB=$HADOOP/lib

export LD_LIBRARY_PATH=/usr/local/lib/

export HADOOP_BIN=/usr/local/hadoop/bin

export HADOOP_LIB=/usr/local/hadoop/etc/hadoop

#export HADOOP_CONF_DIR=/usr/local/hadoop

/etc/environment

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"

JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

HADOOP_HOME=/usr/local/hadoop

HADOOP_BIN=/usr/local/hadoop/bin

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

PKG_CONFIG_PATH=/usr/local/lib

/etc/R/Rprofile.site

/etc/R/Renviron.site

####################################################

Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")

Sys.setenv(HADOOP_HOME="/usr/local/hadoop")

Sys.setenv(HADOOP_BIN="/usr/local/hadoop/bin")

Sys.setenv(HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop/")

####################################################

Rhipe download: http://ml.stat.purdue.edu/rhipebin/archive/

REMEMBER:

check protobuf:

pkg-config --modversion protobuf

pkg-config --libs protobuf

bin: /usr/local/hadoop/bin

lib: /usr/local/hadoop/lib

conf: /usr/local/hadoop/etc/hadoop/

hadoop: /usr/local/hadoop

Configure to install Rhadoop

Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")

Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.2.jar")

Download packages here:

https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

or find in my downloaded attachments

Install rhbase

1. Build and install Apache Thrift. We recommend that you install on the node containing the HBase Master. See http://thrift.apache.org/ for more details on building and installing Thrift.

Install the dependencies for Thrift. At the prompt, type:

sudo apt-get install libboost-all-dev

Important!  If installing as NON-ROOT,

then you will need a system administrator to help install these dependencies.

3. http://archive.apache.org/dist/thrift/0.8.0/thrift-0.8.0.tar.gz

4.

Unpack the Thrift archive.  At the prompt, type:

tar -xzf thrift-0.8.0.tar.gz

5. Change directory to the versioned thrift directory. At the prompt, type

cd thrift-0.8.0

6. Build the Thrift library. We only need the C++ interface of Thrift, so we build without ruby or python. At the prompt, type the following two commands:

./configure --without-ruby --without-python make

7. Install the Thrift library. At the prompt, type:

make install

Create a symbolic link to the Thrift library so it can be loaded by the rhbase package. Example of symbolic link:

ln -s /usr/local/lib/libthrift-0.8.0.so /usr/lib64

9. Setup the PKG_CONFIG_PATH environment variable. At the prompt, type:

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig

10. https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

11. Install rhbase only on the node that will run the R client.

source: https://github.com/RevolutionAnalytics/RHadoop/wiki/Installing-RHadoop-on-RHEL

Important: Ensure your libthrift-0.8.0 module is in the system library path, which is distinct and separate from the thrift path.

If your "locally installed libraries" are installed in, for example, /usr/local/lib, add this directory to /etc/ld.so.conf (it's a text file) and run "ldconfig"

The command will run a caching utility, but will also create all the necessary "symbolic links" required for the loader system to function. It is surprising that the "make install" for libcurl did not do this already, but it's possible it could not if /usr/local/lib is not in /etc/ld.so.conf already.

PS: it's possible that your /etc/ld.so.conf contains nothing but "include ld.so.conf.d/*.conf". You can still add a directory path after it, or just create a new file inside the directory it's being included from. Dont forget to run "ldconfig" after it.

Be careful. Getting this wrong can screw up your system.

For further interest, please contact me by email.

Google Sites

Report abuse