初期設定
ホームディレクトリを変更する。(/usr/local/hadoopへ変更)
$vim /etc/password
hadoop:x:1002:1002:hadoop,,,:/usr/local/hadoop:/bin/bash
※一度ログアウトして再度ログイン
.bashrcを変更
$vim .bashrc
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export PATH=$PATH:/usr/local/hadoop/bin
$source .bashrc
それか
$vim ~/hadoop-0.21.0/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_PID_DIR=/var/hadoop/pids
$sudo mkdir -p /var/hadoop/pids
$sudo chmod 777 /var/hadoop/pids
ファイルシステムのフォーマット&Hadoop起動
$bin/hadoop namenode -format
$bin/start-all.sh
$jps
5178 TaskTracker
5005 JobTracker
4915 SecondaryNameNode
4734 DataNode
6506 Jps
4534 NameNode
ディレクトリ構成
以下のとおりディレクトリとファイルを作成する。
/usr/local/hadoop/input/example.tsv
/usr/local/hadoop/work/python/map.py
/usr/local/hadoop/work/python/reduce.py
ファイルの中身は以下のとおり
example.tsv
1 test
2 mochi
aaaa
aaaa
test
bbbbb
test
mochi
hagaeru3sei
hagaeru3sei
test
map.py
#!/usr/bin/python<
# -*- coding: utf-8 -*-<
import sys
def main():
while 1:
line = sys.stdin.readline()
if not line:
break
line = line[:-1]
fields = line.split(" ")
print "%s %s" % (fields[0], fields[1])
if __name__ == "__main__":
main()<
reduce.py
#!/usr/bin/env python
# coding: utf-8
import sys
cnt = {}
def main():
global cnt
line = sys.stdin.readline()
try:
while line:
line = line[:-1] # del \n
key, value = line.split(" ")
if not cnt.has_key(value):
cnt[value] = 0
cnt[value] += 1
line = sys.stdin.readline()
except Exception, e:
print(e)
if __name__ == "__main__":
main()
for k, v in cnt.iteritems():
print "[ "+ str(k) +" ]\t:\t"+ str(v)
ローカルテスト実行結果
$ python map.py < /home/hadoop/example/input/example.tsv
1 test
2 mochi
3 aaaa
4 aaaa
5 test
6 bbbbb
7 test
8 mochi
9 hagaeru3sei
10 hagaeru3sei
11 test
$ python reduce.py < /home/hadoop/example/input/example.tsv
[ test ] : 4
[ aaaa ] : 2
[ bbbbb ] : 1
[ hagaeru3sei ] : 2
[ mochi ] : 2
実際にHDFSに登録してみる
$hadoop dfs -copyFromLocal input/example.tsv input/example.tsv
$hadoop dfs -lsr --- (確認)
drwxr-xr-x - hadoop supergroup 0 2010-05-08 23:59 /user/hadoop/input
-rw-r--r-- 1 hadoop supergroup 96 2010-05-08 23:59 /user/hadoop/input/example.tsv
Hadoop実行
$hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -mapper /usr/local/hadoop/work/python/map.py -reducer /usr/local/hadoop/work/python/reduce.py -input input -output output
$./bin/hadoop jar mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -mapper /usr/local/hadoop/work/python/map.py -reducer /usr/local/hadoop/work/python/reduce.py -input input -output output
packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar5391497311484448401/] [] /tmp/streamjob8371643421601457742.jar tmpDir=null
10/05/09 00:16:30 ERROR streaming.StreamJob: Error launching job , Output path already exists : Output directory hdfs://localhost:54310/user/hadoop/output already exists
Streaming Job Failed!
前の処理結果が残ってたらエラーが出たので消す。
$hadoop dfs -rmr output
$hadoop dfs -lsr
drwxr-xr-x - hadoop supergroup 0 2010-05-09 00:26 /user/hadoop/input
-rw-r--r-- 1 hadoop supergroup 96 2010-05-08 23:59 /user/hadoop/input/example.tsv
再度実行
$hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -mapper /usr/local/hadoop/work/python/map.py -reducer /usr/local/hadoop/work/python/reduce.py -input input -output output
packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar2080240995351251437/] [] /tmp/streamjob5088129496510443540.jar tmpDir=null
10/05/09 00:26:43 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/09 00:26:43 INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop-datastore/hadoop-hadoop/mapred/local]
10/05/09 00:26:43 INFO streaming.StreamJob: Running job: job_201005082331_0004
10/05/09 00:26:43 INFO streaming.StreamJob: To kill this job, run:
10/05/09 00:26:43 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201005082331_0004
10/05/09 00:26:43 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201005082331_0004
10/05/09 00:26:44 INFO streaming.StreamJob: map 0% reduce 0%
10/05/09 00:26:52 INFO streaming.StreamJob: map 100% reduce 0%
10/05/09 00:27:04 INFO streaming.StreamJob: map 100% reduce 100%
10/05/09 00:27:07 INFO streaming.StreamJob: Job complete: job_201005082331_0004
10/05/09 00:27:07 INFO streaming.StreamJob: Output: output
成功した! 結果を見てみる。
$hadoop dfs -lsr
drwxr-xr-x - hadoop supergroup 0 2010-05-09 00:26 /user/hadoop/input
-rw-r--r-- 1 hadoop supergroup 96 2010-05-08 23:59 /user/hadoop/input/example.tsv
drwxr-xr-x - hadoop supergroup 0 2010-05-09 00:27 /user/hadoop/output
drwxr-xr-x - hadoop supergroup 0 2010-05-09 00:26 /user/hadoop/output/_logs
drwxr-xr-x - hadoop supergroup 0 2010-05-09 00:26 /user/hadoop/output/_logs/history
-rw-r--r-- 1 hadoop supergroup 17327 2010-05-09 00:26 /user/hadoop/output/_logs/history/localhost_1273329121967_job_201005082331_0004_conf.xml
-rw-r--r-- 1 hadoop supergroup 9059 2010-05-09 00:26 /user/hadoop/output/_logs/history/localhost_1273329121967_job_201005082331_0004_hadoop_streamjob5088129496510443540.jar
-rw-r--r-- 1 hadoop supergroup 79 2010-05-09 00:26 /user/hadoop/output/part-00000
ちゃんとoutputディレクトリにデータが入ってる。
中身を見てみる。
$hadoop dfs -cat /user/hadoop/output/part-00000
[ test ] : 4
[ mochi ] : 2
[ hagaeru3sei ] : 2
[ aaaa ] : 2
[ bbbbb ] : 1
よし!動いてる!
HDFSからの取り出し
copyToLocalコマンドを使う
./bin/hadoop dfs -copyToLocal output/part-00000 .
参考サイト