Helpful site: http://backtest-with-r.blogspot.com/
https://groups.google.com/forum/#!forum/rhipe
http://ksssblogs.blogspot.com/?view=flipcard
Hadoop installation in Ubuntu 16.4 LTS
1. https://medium.com/@ujadhav25/installation-of-hadoop-2-7-3-5586a8634a18#.csmwmv7og
2. http://log.malchiodi.com/2015/12/09/installing-hadoop-271-from-scratch-2015-version/
3. http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Configuration Files are the files which are located in the extracted tar.gz file in the etc/hadoop/ directory.
All Configuration Files in Hadoop are listed below,
1) HADOOP-ENV.sh->>It specifies the environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop). We know that Hadoop framework is wriiten in Java and uses JRE so one of the environment variable in Hadoop Daemons is $Java_Home in Hadoop-env.sh.
2) CORE-SITE.XML->>It is one of the important configuration files which is required for runtime environment settings of a Hadoop cluster.It informs Hadoop daemons where the NAMENODE runs in the cluster. It also informs the Name Node as to which IP and ports it should bind.
3) HDFS-SITE.XML->>It is one of the important configuration files which is required for runtime environment settings of a Hadoop. It contains the configuration settings for NAMENODE, DATANODE, SECONDARYNODE. It is used to specify default block replication. The actual number of replications can also be specified when the file is created,
4) MAPRED-SITE.XML->>It is one of the important configuration files which is required for runtime environment settings of a Hadoop. It contains the configuration settings for MapReduce . In this file, we specify a framework name for MapReduce, by setting the MapReduce.framework.name.
5) Masters->>It is used to determine the master Nodes in Hadoop cluster. It will inform about the location of SECONDARY NAMENODE to Hadoop Daemon.
The Mater File on Slave node is blank.
6) Slave->>It is used to determine the slave Nodes in Hadoop cluster.
The Slave file at Master Node contains a list of hosts, one per line.
The Slave file at Slave server contains IP address of Slave nodes.
Follow the link to learn more about configuration files in Hadoop
1. Hadoop Configuration: ~/.bashrc config
# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
# Java path
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin
# For configuring the Hadoop libraries,
export PKG_CONFIG_PATH="/usr/local/lib"
export LD_LIBRARY_PATH="/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64:/usr/lib/jvm/java-8_openjdk-amd64/jre/lib/amd64/server"
export HADOOP="/usr/local/hadoop"
export HADOOP_BIN="/usr/local/hadoop/bin"
2. Sometime when you start up again Hadoop, you probably missing datanode, use:
sudo mv /usr/local/hadoop/hadoop_store/hdfs/datanode /usr/local/hadoop/hadoop_store/hdfs/datanode1
sudo mkdir /usr/local/hadoop/hadoop_store/hdfs/datanode
hadoop namenode -format
start-all.sh
jps
THEN IT WORKS!
Google it, you can have link source: http://stackoverflow.com/questions/11889261/datanode-process-not-running-in-hadoop
3. Rhadoop Installation Issues:
Of course, to install Rhadoop you need to install packages (search for it from internet), however you may be in trouble with rJava installation. This error could be: libjvm.so: cannot open shared object file
Its solution can find here: http://solaimurugan.blogspot.com/2015/11/rhadoop-integration-isssues.html
First: locate libjvm.so
Then you know where libjvm.so is, the make shared link,
sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so /usr/lib/
OK?
4. Running wordcount example in hadoop (java-eclipse)
source: https://portal.futuresystems.org/manual/hadoop-wordcount
1. Download and unzip WordCount under $HADOOP_HOME
or you can download it from the attached file below
Assuming you start SalsaHadoop/Hadoop with setting $HADOOP_HOME=~/hadoop-0.20.203.0, and are running the master node on i55, download and unzip the WordCount source code from Big Data for Science tutorial under $HADOOP_HOME.
[taklwu@i55 ~]$ cd $HADOOP_HOME [taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-WordCount.zip [taklwu@i55 hadoop-0.20.203.0]$ unzip Hadoop-WordCount.zip
2. Execute: Hadoop-WordCount
First, we need to upload the input files (any text format file) into Hadoop distributed file system (HDFS):
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -put $HADOOP_HOME/Hadoop-WordCount/input/ input [taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls input
Here, $HADOOP_HOME/Hadoop-WordCount/input/ is the local directory where the program inputs are stored. The second "input" represents the remote destination directory on the HDFS.
After uploading the inputs into HDFS, run the WordCount program with the following commands. We assume you have already compiled the word count program.
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop jar $HADOOP_HOME/Hadoop-WordCount/wordcount.jar WordCount input output
If Hadoop is running correctly, it will print hadoop running messages similar to the following:
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. 11/11/02 18:34:46 INFO input.FileInputFormat: Total input paths to process : 1 11/11/02 18:34:46 INFO mapred.JobClient: Running job: job_201111021738_0001 11/11/02 18:34:47 INFO mapred.JobClient: map 0% reduce 0% 11/11/02 18:35:01 INFO mapred.JobClient: map 100% reduce 0% 11/11/02 18:35:13 INFO mapred.JobClient: map 100% reduce 100% 11/11/02 18:35:18 INFO mapred.JobClient: Job complete: job_201111021738_0001 11/11/02 18:35:18 INFO mapred.JobClient: Counters: 25 ...
3. Monitoring Hadoop
We can also monitor the job status using lynx, a text browser, on i136 based Hadoop monitoring console. Assuming the Hadoop Jobtracker is running on i55:9003, you can type:
[taklwu@i136 ~]$ lynx i55:9003
4. Check the result
After finishing the job, please use theses commands to check the output:
[taklwu@i55 ~]$ cd $HADOOP_HOME [taklwu@i55 ~]$ bin/hadoop fs -ls output [taklwu@i55 ~]$ bin/hadoop fs -cat output/*
Here, "output" is the HDFS directory where the result stored. The result will look like the following:
you." 15 you; 1 you? 2 you?" 23 young 42
5. Finishing the Map-Reduce process
After finishing the job, please use this command to kill the HDFS and Map-Reduce daemon:
[taklwu@i55 hadoop-0.20.203.0]$ bin/stop-all.sh
6. Rhipe installation
~/.bashrc
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin
export PKG_CONFIG_PATH=/usr/local/lib/
export HADOOP=/usr/local/hadoop
export HADOOP_LIB=$HADOOP/lib
export LD_LIBRARY_PATH=/usr/local/lib/
export HADOOP_BIN=/usr/local/hadoop/bin
export HADOOP_LIB=/usr/local/hadoop/etc/hadoop
#export HADOOP_CONF_DIR=/usr/local/hadoop
/etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
HADOOP_HOME=/usr/local/hadoop
HADOOP_BIN=/usr/local/hadoop/bin
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
PKG_CONFIG_PATH=/usr/local/lib
/etc/R/Rprofile.site
/etc/R/Renviron.site
####################################################
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_BIN="/usr/local/hadoop/bin")
Sys.setenv(HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop/")
####################################################
Rhipe download: http://ml.stat.purdue.edu/rhipebin/archive/
REMEMBER:
check protobuf:
pkg-config --modversion protobuf
pkg-config --libs protobuf
bin: /usr/local/hadoop/bin
lib: /usr/local/hadoop/lib
conf: /usr/local/hadoop/etc/hadoop/
hadoop: /usr/local/hadoop
Configure to install Rhadoop
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.2.jar")
Download packages here:
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
or find in my downloaded attachments
Install rhbase
1. Build and install Apache Thrift. We recommend that you install on the node containing the HBase Master. See http://thrift.apache.org/ for more details on building and installing Thrift.
2.
Install the dependencies for Thrift. At the prompt, type:
sudo apt-get install libboost-all-dev
Important! If installing as NON-ROOT,
then you will need a system administrator to help install these dependencies.
3. http://archive.apache.org/dist/thrift/0.8.0/thrift-0.8.0.tar.gz
4.
Unpack the Thrift archive. At the prompt, type:
tar -xzf thrift-0.8.0.tar.gz
5.
Change directory to the versioned thrift
directory. At the prompt, type
cd thrift-0.8.0
6.
Build the Thrift library. We only need the C++ interface of Thrift, so we build without ruby or python. At the prompt, type the following two commands:
./configure --without-ruby --without-python make
7.
Install the Thrift library. At the prompt, type:
make install
8.
Create a symbolic link to the Thrift library so it can be loaded by the rhbase
package. Example of symbolic link:
ln -s /usr/local/lib/libthrift-0.8.0.so /usr/lib64
9.
Setup the PKG_CONFIG_PATH
environment variable. At the prompt, type:
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig
10. https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
11.
Install rhbase
only on the node that will run the R client.
source: https://github.com/RevolutionAnalytics/RHadoop/wiki/Installing-RHadoop-on-RHEL
Important: Ensure your libthrift-0.8.0 module is in the system library path, which is distinct and separate from the thrift path.
If your "locally installed libraries" are installed in, for example, /usr/local/lib, add this directory to /etc/ld.so.conf (it's a text file) and run "ldconfig"
The command will run a caching utility, but will also create all the necessary "symbolic links" required for the loader system to function. It is surprising that the "make install" for libcurl did not do this already, but it's possible it could not if /usr/local/lib is not in /etc/ld.so.conf already.
PS: it's possible that your /etc/ld.so.conf contains nothing but "include ld.so.conf.d/*.conf". You can still add a directory path after it, or just create a new file inside the directory it's being included from. Dont forget to run "ldconfig" after it.
Be careful. Getting this wrong can screw up your system.
For further interest, please contact me by email.