Hadoop and MapReduce

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split. The default of split is HDFS block size 128Mb

Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication would be overkill.

“When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner — which buckets keys using a hash function — works very well.”

ZooKeeper

http://habrahabr.ru/company/yandex/blog/234335/

https://www.lektorium.tv/lecture/14880

https://player.oreilly.com/videos/9781491931028

системы для конфигураций своих кластеров. Именно в этом основная цель ZooKeeper — хранение и управление конфигурациями определенных систем, а локи получились как побочный продукт. В итоге вся эта система была создана для построения различных примитивных синхронизаций клиентским кодом. В самом ZooKeeper явных понятий подобных очередям нет, все это реализуется на стороне клиентских библиотек.

протокол, используемый Zookeeper называется ZAB. Основа ZooKeeper — виртуальная файловая система, которая состоит из взаимосвязанных узлов, которые представляют собой совмещенное понятие файла и директории. Каждый узел этого дерева может одновременно хранить данные и иметь подчиненные узлы. Помимо этого в системе существует два типа нод: есть так называемые persistent-ноды, которые сохраняются на диск и никогда не пропадают, и есть эфемерные ноды, которые принадлежат какой-то конкретной сессии и существуют, пока существует она.

http://bigdatauniversity.com/

http://hortonworks.com/tutorials/

http://www.cloudera.com/content/cloudera/en/documentation/HadoopTutorial/CDH5/Hadoop-Tutorial.html

http://habrahabr.ru/company/dca/blog/268277/

http://hortonworks.com/tutorials/

http://www.teckstory.com/hadoop-ecosystem/map-reduce-concepts-part-1/

http://www.wiziq.com/blog/31-questions-for-hadoop-developers/

http://www.edureka.co/blog/hadoop-interview-questions-hdfs-2/

http://career.guru99.com/top-20-hadoop-mapreduce-interview-question/

http://hadooptutorial.info/forums/topic/250-hadoop-interview-questions-for-experienced-hadoop-developers/

http://yahoohadoop.tumblr.com/

http://www.analyticsvidhya.com/blog/2015/07/big-data-analytics-youtube-ted-resources/

http://www.slideshare.net/it-people/nosql-32925224 NoSQL

https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

https://yadi.sk/d/bw2Z2we4ecGmL/Hadoop/%D0%9B%D0%B5%D0%BA%D1%86%D0%B8%D0%B8

http://www.thoughtworks.com/insights/blog/nosql-databases-overview

http://www.slideshare.net/IvanGlushkov/newsql-overview

http://blog.eviac.net/2015/08/an-introduction-to-yarn.html

http://softwareengineeringdaily.com/2015/08/07/apache-zookeeper-with-flavio-junqueira/

http://habrahabr.ru/post/240405/

http://am.livejournal.com/577957.html?style=mine#cutid1

http://habrahabr.ru/post/223903/

http://www.mapr.com/resources/open-source-projects

http://gethue.com/ Hadoop GUI

http://www.hadoopweekly.com

http://www.slideshare.net/hortonworks/stinger-initiative-deep-dive

http://www.infoq.com/bigdata/

http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

http://drill-user.org/

http://www.infoq.com/news/2013/02/Stinger

http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/

http://hadapt.com/blog/2013/10/02/classifying-the-sql-on-hadoop-solutions/

http://hadapt.com/blog/2012/12/21/classifying-todays-big-data-innovators/

http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

http://www.ibm.com/developerworks/opensource/library/os-twitterstorm/index.html

http://nerds.airbnb.com/redshift-performance-cost/ (Paracel on Amazon = Redshift)

Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer (Book)

http://www.javacodegeeks.com/2013/07/mapreduce-algorithms-understanding-data-joins-part-1.html

http://www.javacodegeeks.com/2012/11/calculating-a-co-occurrence-matrix-with-hadoop.html

http://www.javacodegeeks.com/2013/09/configuring-hadoop-with-guava-mapsplitters.html

http://www.javacodegeeks.com/2013/09/run-your-hadoop-mapreduce-job-on-amazon-emr.html

Oracle+Hadoop integration

http://cs.yale.edu/homes/xs45/pdf/ss-sigmod2012.pdf

http://www.qubole.com/resources/hive-and-hadoop-tutorial-and-training-resources/

Amazon Elastic map-reduce

http://www.slideshare.net/imcinstitute/big-data-hadoop-using-amazon-elastic-mapreduce-handson-labs

http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

http://www.computerra.ru/82659/mapreduce/

Do not use Hadoop if you can!

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

https://news.ycombinator.com/item?id=6398650

http://www.reddit.com/r/programming/comments/1mkvhs/dont_use_hadoop_your_data_isnt_that_big/

http://www.datasalt.com/blog/

C++ map reduce

http://cdmh.co.uk/papers/software_scalability_mapreduce/library

https://github.com/cdmh/mapreduce

https://code.google.com/p/mapreduce-lite/

http://www.infoq.com/presentations/Introducing-Apache-Hadoop

http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/

http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/

http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/

http://www.ibm.com/developerworks/opensource/library/os-spark/index.html

http://www.splunk.com/product

http://hortonworks.com/blog/the-data-lifecycle-part-two-mining-avros-with-pig-consuming-data-with-hive/

http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

http://mycloudresearch.wordpress.com/2012/03/14/simple-hadoop-overview/

http://www.youtube.com/watch?v=EIS-CcdmLe0

http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/

http://ayende.com/blog/4435/map-reduce-a-visual-explanation

http://www.javacodegeeks.com/2012/05/mapreduce-questions-and-answers-part-1.html

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

http://www.infoq.com/articles/HadoopOutputFormat

http://www.sfbayacm.org/introduction-mining-big-data-map-reduce

http://www.manamplified.org/archives/2011/07/common-mapreduce-patterns.html

http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/

http://blogs.msdn.com/b/seliot/archive/2010/11/05/cosmos-petabytes-perfectly-processed-perfunctorily.aspx Microsoft

http://research.microsoft.com/en-us/projects/Dryad/

http://habrahabr.ru/post/161437/ HADOOP 2.0 YARN

The input to a MapReduce job is just a set of (input_key,input_value) pairs, which we’ll implement as a Python dictionary. In the wordcount example, the input keys will be the filenames of the files we’re interested in counting words in, and the corresponding input values will be the contents of those files:

filenames = ["a.txt","b.txt","c.txt"]

i = {}

for filename in filenames:

f = open(filename)

i[filename] = f.read()

f.close()

http://pulasthisupun.blogspot.com/2016/06/apache-hadoop-detailed-word-count.html

After this code is run the Python dictionary i will contain the input to our MapReduce job, namely, i has three keys containing the filenames, and three corresponding values containing the contents of those files.

In the map phase what happens is that for each (input_key,input_value) pair in the input dictionary i, a function mapper(input_key,input_value) is computed, whose output is a list of intermediate keys and values. This function mapper is supplied by the programmer. mapper takes the input key and input value – a filename, and a string containing the contents of the file – and then moves through the words in the file. For each word it encounters, it returns the intermediate key and value (word,1).

A programmer-defined function reducer(intermediate_key,intermediate_value_list) is applied to each entry in the intermediate dictionary. For wordcount, reducer simply sums up the list of intermediate values, and return both the intermediate_key and the sum as the output.

http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/

The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comparable, hence the key has to implement one more interface WritableComparable.

Writable Interface: http://developer.yahoo.com/hadoop/tutorial/module5.html So key types must implement a stricter interface, WritableComparable. In addition to being Writable so they can be transmitted over the network, they also obey Java's Comparable interface

http://code.google.com/edu/parallel/mapreduce-tutorial.html

http://wiki.apache.org/hadoop/GettingStartedWithHadoop

http://www.ibm.com/developerworks/linux/library/l-hadoop-3/index.html

http://developer.yahoo.com/hadoop/tutorial/index.html

http://blog.doughellmann.com/2009/04/implementing-mapreduce-with.html

Hadoop 0.22: Ordered Record Collection

http://developer.yahoo.com/blogs/ydn/posts/2010/01/chris_douglas_ordered_record_collection/

http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/Partitioner.html

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/record/package-summary.html

Hadoop supports a small set of composite types that enable the description of simple aggregate types and containers. A composite type is serialized by sequentially serializing it constituent elements. The supported composite types are:

* record: An aggregate type like a C-struct. This is a list of typed fields that are together considered a single unit of data. A record is serialized by sequentially serializing its constituent fields. In addition to serialization a record has comparison operations (equality and less-than) implemented for it, these are defined as memberwise comparisons.

* vector: A sequence of entries of the same data type, primitive or composite.

* map: An associative container mapping instances of a key type to instances of a value type. The key and value types may themselves be primitive or composite types.

Streams

Hadoop generates code for serializing and deserializing record types to abstract streams. For each target language Hadoop defines very simple input and output stream interfaces. Application writers can usually develop concrete implementations of these by putting a one method wrapper around an existing stream implementation.

DDL Syntax and Examples

We now describe the syntax of the Hadoop data description language. This is followed by a few examples of DDL usage. Hadoop DDL Syntax

recfile = *include module *record

include = "include" path

path = (relative-path / absolute-path)

module = "module" module-name

module-name = name *("." name)

record := "class" name "{" 1*(field) "}"

field := type name ";"

name := ALPHA (ALPHA / DIGIT / "_" )*

type := (ptype / ctype)

ptype := ("byte" / "boolean" / "int" |

"long" / "float" / "double"

"ustring" / "buffer")

ctype := (("vector" "<" type ">") /

("map" "<" type "," type ">" ) ) / name)

A DDL file describes one or more record types. It begins with zero or more include declarations, a single mandatory module declaration followed by zero or more class declarations. The semantics of each of these declarations are described below:

* include: An include declaration specifies a DDL file to be referenced when generating code for types in the current DDL file. Record types in the current compilation unit may refer to types in all included files. File inclusion is recursive. An include does not trigger code generation for the referenced file.

* module: Every Hadoop DDL file must have a single module declaration that follows the list of includes and precedes all record declarations. A module declaration identifies a scope within which the names of all types in the current file are visible. Module names are mapped to C++ namespaces, Java packages etc. in generated code.

* class: Records types are specified through class declarations. A class declaration is like a Java class declaration. It specifies a named record type and a list of fields that constitute records of the type. Usage is illustrated in the following examples.

Examples

* A simple DDL file links.jr with just one record declaration.

module links {

class Link {

ustring URL;

boolean isRelative;

ustring anchorText;

};

}

* A DDL file outlinks.jr which includes another

include "links.jr"

module outlinks {

class OutLinks {

ustring baseURL;

vector outLinks;

};

}

Code Generation

The Hadoop translator is written in Java. Invocation is done by executing a wrapper shell script named named rcc. It takes a list of record description files as a mandatory argument and an optional language argument (the default is Java) --language or -l. Thus a typical invocation would look like:

$ rcc -l C++ ...

CompositeInputFormat allows joint 2 sorted big HDFS files

http://www.java2s.com/Open-Source/Java-Document/Database-DBMS/hadoop-0.20.1/org/apache/hadoop/mapred/join/CompositeInputFormat.java.htm

http://www.congiu.com/node/5

Advanced reading

http://blog.griddynamics.com/2010/07/war-story-optimizing-one-hadoop-job.html

http://www.prohadoopbook.com/

http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/

http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/

http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/content/counters.html

Merging multiple files into one within Hadoop

hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile> --destination is local file system

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

use the tool HDFSConcat, new in HDFS 0.21

hadoop dfs -count -q : to see quotas and quota remaining.

How to pipeline several map-reduce jobs all together (output of one job goes to next job)

JobControl to be the simplest method for chaining these jobs together.

For more complex workflows, I'd recommend checking out Oozie.

ChainMapper and ChainReducer

Job job1 = new Job( getConf() );

job.waitForCompletion( true );

and then check for status using

if(job.isSuccessful()){

//start another job with different Mapper.

//change config

Job job2 = new Job( getConf() );

}

http://habrahabr.ru/post/195040/ Loading data to HBase

Secondary Sort

http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html#JobControl

http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/

http://www.umiacs.umd.edu/~jimmylin/book.html

http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/

http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/

WordCount: is it possible to make it sorted according to the most number of word occurrences

http://stackoverflow.com/questions/2550784/hadooop-map-reduce

Map/Reduce using Python

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

http://users.livejournal.com/_winnie/301995.html

http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

http://remembersaurus.com/mincemeatpy/

http://discoproject.org/

http://clouddbs.blogspot.com/2010/10/googles-mapreduce-in-98-lines-of-python.html

http://www.youtube.com/watch?v=yjPBkvYh-ss

http://remembersaurus.com/mincemeatpy/

http://clouddbs.blogspot.com/2010/10/googles-mapreduce-in-98-lines-of-python.html

http://www.youtube.com/watch?v=yjPBkvYh-

http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html

http://code.google.com/edu/submissions/mapreduce/listing.html

http://binarynerd.com/java-tutorials/distributed-computing/installing-hadoop.html

http://www.higherpass.com/java/Tutorials/Building-Hadoop-Mapreduce-Jobs-In-Java/

http://www.nd.edu/~ccl/operations/hadoop/

http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/

Hadoop

http://hadoop.apache.org/common/releases.html

http://developer.yahoo.com/hadoop/tutorial/

http://www.jakobhoman.com/2007/09/useful-hadoop-resources.html

Code

hadoop jar hadoop-examples.jar pi 4 10000

ant -Dcompile.c++=yes compile-c++-examples

http://hadoop.apache.org/mapreduce/releases.html#Download

http://marionote.wordpress.com/2010/06/04/hadoop-installation-note-standalone/

http://wiki.apache.org/hadoop/GettingStartedWithHadoop

http://binarynerd.com/java-tutorials/distributed-computing/installing-hadoop.html

Hadoop on Windows:

http://karmasphere.com/Studio-Eclipse/installation.html

http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/

http://www.infosci.cornell.edu/hadoop/windows.html

http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html

http://vorlsblog.blogspot.com/2010/05/running-hadoop-on-windows-without.html

http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/

Eclipse

Subversion Plugin http://marionote.wordpress.com/2010/01/17/installing-subversive-eclipse3-5/

http://wiki.apache.org/hadoop/EclipseEnvironment

http://code.google.com/edu/parallel/tools/hadoopvm/index.html

Windows7 -64 bits: http://archive.eclipse.org/eclipse/downloads/drops/R-3.5-200906111540/download.php?dropFile=eclipse-SDK-3.5-win32-x86_64.zip

ssh-host-config on Windows7

http://www.kgx.net.nz/2010/03/cygwin-sshd-and-windows-7/

http://chinese-watercolor.com/LRP/printsrv/vista-cygwin.txt

set the following environment variables:

export JAVA_HOME=...

export HADOOP_HOME=...

export PATH=${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH

make a private directory, and upload a file:hadoop version

hadoop fs -ls /

hadoop fs -mkdir /YOURNAME

hadoop fs -put /usr/share/dict/linux.words /YOURNAME/words

hadoop fs -ls /YOURNAME

hadoop fs -cat /YOURNAME/words | less

download WordCount.java to your machine and compile into wordcount.jar it as follows:

mkdir wordcount_classes

javac -classpath ${HADOOP_HOME}/hadoop-*-core.jar -d wordcount_classes WordCount.java

jar -cvf wordcount.jar -C wordcount_classes .

To perform a Map-Reduce job, run hadoop with the jar option and specify the input file and a new directory for output files:

hadoop jar wordcount.jar WordCount /public/warandpeace.txt /YOURNAME/outputs

Now, your outputs are stored under /YOURNAME/outputs in Hadoop:

hadoop fs -ls /YOURNAME/outputs

hadoop fs -cat /YOURNAME/outputs/part-00000

Changes from Hadoop 0.18 to 0.20 http://blog.data-miners.com/

The updated code with the Hadoop 0.20 API is in RowNumberTwoPass-0.20.java.

Before 0.20, Hadoop used classes in a package called "mapred". Starting with 0.20, it uses

classes in "mapreduce". These have a different interface.

The reason for this change has to do with future development for Hadoop. This change

will make it possible to separate releases of HDFS (the distributed file system) and releases of MapReduce. The following are packages that

contain the new interface:

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.map.*;

import org.apache.hadoop.mapreduce.lib.reduce.*;

import org.apache.hadoop.mapreduce.lib.input.*;

import org.apache.hadoop.mapreduce.lib.output.*;

The following changes in the Map and Reduce classes:

The classes now longer need the "implements" syntax.
The function called before the map/reduce is now called setup() rather than configure().
The function called after the map/reduce is called cleanup().
The functions all take an argument whose class is Context; this is used instead of Reporter and OutputCollector.
The map and reduce functions can also throw InterruptedException.

The driver function has more changes, caused by the fact that JobConf is no longer part of the interface. Instead, the work is set up using Job. Variables and values are passed into the Map and Reduce class through ConfJobConf. Also, the code for the Map and Reduce classes is added in using the call Job.setJarByClass(). rather than JobConf. Also, the code for the Map and Reduce classes is added in using the call Job.setJarByClass().

Implement the Hadoop Tool interface

http://grepalex.com/2013/02/25/hadoop-libjars/

http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html

* -D to pass in arbitrary HADOOP JOB properties (e.g. -D mapred.reduce.tasks=7 sets the number of reducers to 7) (space after -D)

* -files to put files into the distributed cache

* -archives to put archives (tar, tar.gz, zip, jar) into the distributed cache

* -libjars to put JAR files on the task classpath

public class MyJob extends Configured implements Tool {

public int run(String[] args) throws Exception {

JobConf job = new JobConf(getConf(), MyJob.class);

// run job ...

}

public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(),

new MyJob(), args);

System.exit(res);

}

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

http://cxwangyi.blogspot.com/2009/12/wordcount-tutorial-for-hadoop-0201.html

Hadoop C++

http://wiki.apache.org/hadoop/C%2B%2BWordCount

http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop

http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html

Makefile

CC = g++

HADOOP_INSTALL=/home/y/libexec/hadoop

#PLATFORM = Linux-amd64-64

PLATFORM = Linux-i386-32

CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount2: wordcount.cpp

$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \

-lhadooputils -lpthread -g -O2 -o $@

debugging hadoop: http://habrahabr.ru/blogs/hi/89365/#habracut

The NameNode stores all information about the file system namespace in a file called FsImage. This file, along with a record of all transactions (referred to as the EditLog), is stored on the local file system of the NameNode. The FsImage and EditLog files are also replicated to protect against file corruption or loss of the NameNode system itsel. The NameNode relies on periodic heartbeat messages from each DataNode

http://blip.tv/search?q=hadoop

Sequence File

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

cascading.org

http://www.cascading.org/ http://www.karmasphere.com/

http://gigaom.com/cloud/twitter-to-open-source-hadoop-like-tool/ Real-Time Stream Processing

Lucene and Solr

Solr is a high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.

http://lucene.apache.org/

High-Scale Archiectures

http://www.royans.net/arch/library/

http://www.jiahenglu.net/course/advancedDataManagement/

http://horicky.blogspot.com/2010/10/scalable-system-design-patterns.html

http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html

Pregel: A System for Large-Scale Graph Processing

http://www.royans.net/arch/pregel-googles-other-data-processing-infrastructure/

http://highscalability.com/product-hadoop

Page updated

Google Sites

Report abuse