https://www.amazon.com/Hadoop-Application-Architectures-Real-World-Applications/dp/1491900083 Book
http://www.johnwittenauer.net/how-to-learn-hadoop-for-free/
https://www.datadoghq.com/blog/hadoop-architecture-overview/
https://www.dezyre.com/article/top-100-hadoop-interview-questions-and-answers-2017/159
https://www.dezyre.com/blog/interview-questions
http://www.themiddlewareshop.com/2017/01/05/creating-a-hadoop-cluster-using-ambari/
https://www.youtube.com/watch?v=U-LRdse5Xms
https://www.youtube.com/watch?v=CmuA9yhCmNY
https://habrahabr.ru/post/319048/
http://haifengl.github.io/bigdata/
http://www.dattamsha.com/2014/09/hadoop-mr-vs-spark-rdd-wordcount-program/
http://www.agildata.com/sql-on-hadoop-the-differences-and-making-the-right-choice/
https://www.youtube.com/watch?v=nbiDOb06qYc
https://greppage.com/evidanary/10 cheatsheet
http://linoxide.com/file-system/hadoop-hdfs-shell-commands/
https://data-flair.training/blogs/spark-interview-questions/
https://data-flair.training/blogs/popular-data-science-interview-questions/
https://data-flair.training/blogs/category/interview-questions/
https://mindmajix.com/hadoop-interview-questions
https://acadgild.com/blog/hadoop-interview-questions/
https://data-flair.training/blogs/top-100-hadoop-interview-questions-and-answers/
https://dzone.com/articles/6-faq-hadoop-interview-questions-amp-answers-with
https://acadgild.com/blog/frequently-asked-hadoop-interview-questions-2017-part-1/
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
https://www.edureka.co/blog/interview-questions/top-50-hadoop-interview-questions-2016/
https://www.whizlabs.com/blog/top-50-hadoop-interview-questions/
https://www.dezyre.com/blog/interview-questions
https://www.guru99.com/hadoop-mapreduce-interview-question.html
https://mycyberuniverse.com/linux/find-and-delete-the-zero-size-files-and-empty-directories.html
Number of mappers is defined implicitly via split size
https://hadoopjournal.wordpress.com/2015/06/13/set-mappers-in-pig-hive-and-mapreduce/
http://davidchang168.blogspot.com/2014/01/how-to-change-input-split-size.html
http://www.idryman.org/blog/2014/03/05/hadoop-performance-tuning-best-practices/
http://www.cloudera.com/developers/featured-video.html
https://bigdatauniversity.com/
https://www.youtube.com/watch?v=K14plpZgy_c (data: https://data.sfgov.org)
io.sort.factor
io.sort.mb
mapred.min.split.size
mapred.max.split.size
mapred.min.split.size.per.rack
mapred.min.split.size .per.node
Each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task.
Map-Reduce Joins
http://www.slideshare.net/shalishvj/map-reduce-joins-31519757
http://unmeshasreeveni.blogspot.com/2014/12/joining-two-files-using-multipleinput.html
http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf
http://www.edureka.co/blog/map-side-join-vs-join/
http://hadooped.blogspot.com/2013/09/reduce-side-joins-in-java-map-reduce.html
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html
http://codingjunkie.net/mapreduce-reduce-joins/
If you want to merge them:
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt
Interview
http://www.java-success.com/01-hadoop-bigdata-overview-interview-questions-answers/
http://www.hadooptpoint.com/category/interview-questions/
http://hadooptutorial.info/hadoop-and-hive-interview-cheat-sheet/
http://hadooptutorial.info/mapreduce-multiple-outputs-use-case/
http://hadooptutorial.info/hadoop-interview-questions-part-1/
http://hadooptutorial.info/hadoop-interview-questions-and-answers-part-2/
http://hadooptutorial.info/hadoop-interview-questions-and-answers-part-3/
http://hadooptutorial.info/hadoop-interview-questions-and-answers-part-4/
http://hadooptutorial.info/hadoop-interview-questions-answers-part-5/
http://hadooptutorial.info/mapreduce-program-to-calculate-missing-count/
File format
http://hadooptutorial.info/hadoop-input-formats/
http://www.jowanza.com/post/158761265324/which-hadoop-file-format-should-i-use
http://hadooptutorial.info/merging-small-files-into-avro-file/
http://hadooptutorial.info/merging-small-files-into-sequencefile/
https://news.ycombinator.com/item?id=13263765
http://hadooptutorial.info/hadoop-output-formats/
http://blog.matthewrathbone.com/2016/09/01/a-beginners-guide-to-hadoop-storage-formats.html
http://www.baeldung.com/apache-thrift
Parquet
https://habrahabr.ru/company/wrike/blog/279797/
https://habrahabr.ru/post/282552/
-Data Algorithms: Recipes for Scaling Up with Hadoop and Spark by Mahmoud Parsian
http://www.hadoopinrealworld.com/hadoopstarterkit/
http://engineering.bloomreach.com/mapreduce-fun-sampling-for-large-data-set/
http://habrahabr.ru/company/dca/blog/270453/
https://habrahabr.ru/post/283242/ HUE
https://habrahabr.ru/post/283138/ BIG DATA and JAVA DIGEST
https://wiki.apache.org/hadoop/Books
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split. The default of split is HDFS block size 128Mb
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication would be overkill.
“When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner — which buckets keys using a hash function — works very well.”
ZooKeeper
http://habrahabr.ru/company/yandex/blog/234335/
https://www.lektorium.tv/lecture/14880
https://player.oreilly.com/videos/9781491931028
системы для конфигураций своих кластеров. Именно в этом основная цель ZooKeeper — хранение и управление конфигурациями определенных систем, а локи получились как побочный продукт. В итоге вся эта система была создана для построения различных примитивных синхронизаций клиентским кодом. В самом ZooKeeper явных понятий подобных очередям нет, все это реализуется на стороне клиентских библиотек.
протокол, используемый Zookeeper называется ZAB. Основа ZooKeeper — виртуальная файловая система, которая состоит из взаимосвязанных узлов, которые представляют собой совмещенное понятие файла и директории. Каждый узел этого дерева может одновременно хранить данные и иметь подчиненные узлы. Помимо этого в системе существует два типа нод: есть так называемые persistent-ноды, которые сохраняются на диск и никогда не пропадают, и есть эфемерные ноды, которые принадлежат какой-то конкретной сессии и существуют, пока существует она.
http://hortonworks.com/tutorials/
http://www.cloudera.com/content/cloudera/en/documentation/HadoopTutorial/CDH5/Hadoop-Tutorial.html
http://habrahabr.ru/company/dca/blog/268277/
http://hortonworks.com/tutorials/
http://www.teckstory.com/hadoop-ecosystem/map-reduce-concepts-part-1/
http://www.teckstory.com/hadoop-ecosystem/map-reduce-concepts-part-1/
http://www.wiziq.com/blog/31-questions-for-hadoop-developers/
http://www.edureka.co/blog/hadoop-interview-questions-hdfs-2/
http://career.guru99.com/top-20-hadoop-mapreduce-interview-question/
http://yahoohadoop.tumblr.com/
http://www.analyticsvidhya.com/blog/2015/07/big-data-analytics-youtube-ted-resources/
http://www.slideshare.net/it-people/nosql-32925224 NoSQL
https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
https://yadi.sk/d/bw2Z2we4ecGmL/Hadoop/%D0%9B%D0%B5%D0%BA%D1%86%D0%B8%D0%B8
http://www.thoughtworks.com/insights/blog/nosql-databases-overview
http://www.slideshare.net/IvanGlushkov/newsql-overview
http://blog.eviac.net/2015/08/an-introduction-to-yarn.html
http://softwareengineeringdaily.com/2015/08/07/apache-zookeeper-with-flavio-junqueira/
http://habrahabr.ru/post/240405/
http://am.livejournal.com/577957.html?style=mine#cutid1
http://habrahabr.ru/post/223903/
http://www.mapr.com/resources/open-source-projects
http://gethue.com/ Hadoop GUI
http://www.slideshare.net/hortonworks/stinger-initiative-deep-dive
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/
http://www.infoq.com/news/2013/02/Stinger
http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/
http://hadapt.com/blog/2013/10/02/classifying-the-sql-on-hadoop-solutions/
http://hadapt.com/blog/2012/12/21/classifying-todays-big-data-innovators/
http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/
http://www.ibm.com/developerworks/opensource/library/os-twitterstorm/index.html
http://nerds.airbnb.com/redshift-performance-cost/ (Paracel on Amazon = Redshift)
Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer (Book)
http://www.javacodegeeks.com/2013/07/mapreduce-algorithms-understanding-data-joins-part-1.html
http://www.javacodegeeks.com/2012/11/calculating-a-co-occurrence-matrix-with-hadoop.html
http://www.javacodegeeks.com/2013/09/configuring-hadoop-with-guava-mapsplitters.html
http://www.javacodegeeks.com/2013/09/run-your-hadoop-mapreduce-job-on-amazon-emr.html
Oracle+Hadoop integration
http://cs.yale.edu/homes/xs45/pdf/ss-sigmod2012.pdf
http://www.qubole.com/resources/hive-and-hadoop-tutorial-and-training-resources/
Amazon Elastic map-reduce
http://www.slideshare.net/imcinstitute/big-data-hadoop-using-amazon-elastic-mapreduce-handson-labs
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
http://www.computerra.ru/82659/mapreduce/
Do not use Hadoop if you can!
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
https://news.ycombinator.com/item?id=6398650
http://www.reddit.com/r/programming/comments/1mkvhs/dont_use_hadoop_your_data_isnt_that_big/
C++ map reduce
http://cdmh.co.uk/papers/software_scalability_mapreduce/library
https://github.com/cdmh/mapreduce
https://code.google.com/p/mapreduce-lite/
http://www.infoq.com/presentations/Introducing-Apache-Hadoop
http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/
http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/
http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/
http://www.ibm.com/developerworks/opensource/library/os-spark/index.html
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
http://mycloudresearch.wordpress.com/2012/03/14/simple-hadoop-overview/
http://www.youtube.com/watch?v=EIS-CcdmLe0
http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/
http://ayende.com/blog/4435/map-reduce-a-visual-explanation
http://www.javacodegeeks.com/2012/05/mapreduce-questions-and-answers-part-1.html
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
http://www.infoq.com/articles/HadoopOutputFormat
http://www.sfbayacm.org/introduction-mining-big-data-map-reduce
http://www.manamplified.org/archives/2011/07/common-mapreduce-patterns.html
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/
http://research.microsoft.com/en-us/projects/Dryad/
http://habrahabr.ru/post/161437/ HADOOP 2.0 YARN
The input to a MapReduce job is just a set of (input_key,input_value) pairs, which we’ll implement as a Python dictionary. In the wordcount example, the input keys will be the filenames of the files we’re interested in counting words in, and the corresponding input values will be the contents of those files:
filenames = ["a.txt","b.txt","c.txt"]
i = {}
for filename in filenames:
f = open(filename)
i[filename] = f.read()
f.close()
http://pulasthisupun.blogspot.com/2016/06/apache-hadoop-detailed-word-count.html
After this code is run the Python dictionary i will contain the input to our MapReduce job, namely, i has three keys containing the filenames, and three corresponding values containing the contents of those files.
In the map phase what happens is that for each (input_key,input_value) pair in the input dictionary i, a function mapper(input_key,input_value) is computed, whose output is a list of intermediate keys and values. This function mapper is supplied by the programmer. mapper takes the input key and input value – a filename, and a string containing the contents of the file – and then moves through the words in the file. For each word it encounters, it returns the intermediate key and value (word,1).
A programmer-defined function reducer(intermediate_key,intermediate_value_list) is applied to each entry in the intermediate dictionary. For wordcount, reducer simply sums up the list of intermediate values, and return both the intermediate_key and the sum as the output.
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html
http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/
The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comparable, hence the key has to implement one more interface WritableComparable.
Writable Interface: http://developer.yahoo.com/hadoop/tutorial/module5.html So key types must implement a stricter interface, WritableComparable. In addition to being Writable so they can be transmitted over the network, they also obey Java's Comparable interface
http://code.google.com/edu/parallel/mapreduce-tutorial.html
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
http://www.ibm.com/developerworks/linux/library/l-hadoop-3/index.html
http://developer.yahoo.com/hadoop/tutorial/index.html
http://blog.doughellmann.com/2009/04/implementing-mapreduce-with.html
Hadoop 0.22: Ordered Record Collection
http://developer.yahoo.com/blogs/ydn/posts/2010/01/chris_douglas_ordered_record_collection/
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/Partitioner.html
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/record/package-summary.html
Hadoop supports a small set of composite types that enable the description of simple aggregate types and containers. A composite type is serialized by sequentially serializing it constituent elements. The supported composite types are:
* record: An aggregate type like a C-struct. This is a list of typed fields that are together considered a single unit of data. A record is serialized by sequentially serializing its constituent fields. In addition to serialization a record has comparison operations (equality and less-than) implemented for it, these are defined as memberwise comparisons.
* vector: A sequence of entries of the same data type, primitive or composite.
* map: An associative container mapping instances of a key type to instances of a value type. The key and value types may themselves be primitive or composite types.
Streams
Hadoop generates code for serializing and deserializing record types to abstract streams. For each target language Hadoop defines very simple input and output stream interfaces. Application writers can usually develop concrete implementations of these by putting a one method wrapper around an existing stream implementation.
DDL Syntax and Examples
We now describe the syntax of the Hadoop data description language. This is followed by a few examples of DDL usage. Hadoop DDL Syntax
recfile = *include module *record
include = "include" path
path = (relative-path / absolute-path)
module = "module" module-name
module-name = name *("." name)
record := "class" name "{" 1*(field) "}"
field := type name ";"
name := ALPHA (ALPHA / DIGIT / "_" )*
type := (ptype / ctype)
ptype := ("byte" / "boolean" / "int" |
"long" / "float" / "double"
"ustring" / "buffer")
ctype := (("vector" "<" type ">") /
("map" "<" type "," type ">" ) ) / name)
A DDL file describes one or more record types. It begins with zero or more include declarations, a single mandatory module declaration followed by zero or more class declarations. The semantics of each of these declarations are described below:
* include: An include declaration specifies a DDL file to be referenced when generating code for types in the current DDL file. Record types in the current compilation unit may refer to types in all included files. File inclusion is recursive. An include does not trigger code generation for the referenced file.
* module: Every Hadoop DDL file must have a single module declaration that follows the list of includes and precedes all record declarations. A module declaration identifies a scope within which the names of all types in the current file are visible. Module names are mapped to C++ namespaces, Java packages etc. in generated code.
* class: Records types are specified through class declarations. A class declaration is like a Java class declaration. It specifies a named record type and a list of fields that constitute records of the type. Usage is illustrated in the following examples.
Examples
* A simple DDL file links.jr with just one record declaration.
module links {
class Link {
ustring URL;
boolean isRelative;
ustring anchorText;
};
}
* A DDL file outlinks.jr which includes another
include "links.jr"
module outlinks {
class OutLinks {
ustring baseURL;
vector outLinks;
};
}
Code Generation
The Hadoop translator is written in Java. Invocation is done by executing a wrapper shell script named named rcc. It takes a list of record description files as a mandatory argument and an optional language argument (the default is Java) --language or -l. Thus a typical invocation would look like:
$ rcc -l C++ ...
CompositeInputFormat allows joint 2 sorted big HDFS files
Advanced reading
http://blog.griddynamics.com/2010/07/war-story-optimizing-one-hadoop-job.html
http://www.prohadoopbook.com/
http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/
http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/content/counters.html
Merging multiple files into one within Hadoop
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile> --destination is local file system
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
use the tool HDFSConcat, new in HDFS 0.21
hadoop dfs -count -q : to see quotas and quota remaining.
How to pipeline several map-reduce jobs all together (output of one job goes to next job)
JobControl to be the simplest method for chaining these jobs together.
For more complex workflows, I'd recommend checking out Oozie.
ChainMapper and ChainReducer
Job job1 = new Job( getConf() );
job.waitForCompletion( true );
and then check for status using
if(job.isSuccessful()){
//start another job with different Mapper.
//change config
Job job2 = new Job( getConf() );
}
http://habrahabr.ru/post/195040/ Loading data to HBase
Secondary Sort
http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html#JobControl
http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/
http://www.umiacs.umd.edu/~jimmylin/book.html
http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/
http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/
WordCount: is it possible to make it sorted according to the most number of word occurrences
http://stackoverflow.com/questions/2550784/hadooop-map-reduce
Map/Reduce using Python
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
http://users.livejournal.com/_winnie/301995.html
http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://remembersaurus.com/mincemeatpy/
http://discoproject.org/
http://clouddbs.blogspot.com/2010/10/googles-mapreduce-in-98-lines-of-python.html
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://remembersaurus.com/mincemeatpy/
http://clouddbs.blogspot.com/2010/10/googles-mapreduce-in-98-lines-of-python.html
http://www.youtube.com/watch?v=yjPBkvYh-
http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html
http://code.google.com/edu/submissions/mapreduce/listing.html
http://binarynerd.com/java-tutorials/distributed-computing/installing-hadoop.html
http://www.higherpass.com/java/Tutorials/Building-Hadoop-Mapreduce-Jobs-In-Java/
http://www.nd.edu/~ccl/operations/hadoop/
http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
Hadoop
http://hadoop.apache.org/common/releases.html
http://developer.yahoo.com/hadoop/tutorial/
http://www.jakobhoman.com/2007/09/useful-hadoop-resources.html
Code
hadoop jar hadoop-examples.jar pi 4 10000
ant -Dcompile.c++=yes compile-c++-examples
http://hadoop.apache.org/mapreduce/releases.html#Download
http://marionote.wordpress.com/2010/06/04/hadoop-installation-note-standalone/
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
http://binarynerd.com/java-tutorials/distributed-computing/installing-hadoop.html
Hadoop on Windows:
http://karmasphere.com/Studio-Eclipse/installation.html
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
http://www.infosci.cornell.edu/hadoop/windows.html
http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
http://vorlsblog.blogspot.com/2010/05/running-hadoop-on-windows-without.html
http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/
Eclipse
Subversion Plugin http://marionote.wordpress.com/2010/01/17/installing-subversive-eclipse3-5/
http://wiki.apache.org/hadoop/EclipseEnvironment
http://code.google.com/edu/parallel/tools/hadoopvm/index.html
Windows7 -64 bits: http://archive.eclipse.org/eclipse/downloads/drops/R-3.5-200906111540/download.php?dropFile=eclipse-SDK-3.5-win32-x86_64.zip
ssh-host-config on Windows7
http://www.kgx.net.nz/2010/03/cygwin-sshd-and-windows-7/
http://chinese-watercolor.com/LRP/printsrv/vista-cygwin.txt
set the following environment variables:
export JAVA_HOME=...
export HADOOP_HOME=...
export PATH=${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH
make a private directory, and upload a file:hadoop version
hadoop fs -ls /
hadoop fs -mkdir /YOURNAME
hadoop fs -put /usr/share/dict/linux.words /YOURNAME/words
hadoop fs -ls /YOURNAME
hadoop fs -cat /YOURNAME/words | less
download WordCount.java to your machine and compile into wordcount.jar it as follows:
mkdir wordcount_classes
javac -classpath ${HADOOP_HOME}/hadoop-*-core.jar -d wordcount_classes WordCount.java
jar -cvf wordcount.jar -C wordcount_classes .
To perform a Map-Reduce job, run hadoop with the jar option and specify the input file and a new directory for output files:
hadoop jar wordcount.jar WordCount /public/warandpeace.txt /YOURNAME/outputs
Now, your outputs are stored under /YOURNAME/outputs in Hadoop:
hadoop fs -ls /YOURNAME/outputs
hadoop fs -cat /YOURNAME/outputs/part-00000
Changes from Hadoop 0.18 to 0.20 http://blog.data-miners.com/
The updated code with the Hadoop 0.20 API is in RowNumberTwoPass-0.20.java.
Before 0.20, Hadoop used classes in a package called "mapred". Starting with 0.20, it uses
classes in "mapreduce". These have a different interface.
The reason for this change has to do with future development for Hadoop. This change
will make it possible to separate releases of HDFS (the distributed file system) and releases of MapReduce. The following are packages that
contain the new interface:
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.map.*;
import org.apache.hadoop.mapreduce.lib.reduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
The following changes in the Map and Reduce classes:
The classes now longer need the "implements" syntax.
The function called before the map/reduce is now called setup() rather than configure().
The function called after the map/reduce is called cleanup().
The functions all take an argument whose class is Context; this is used instead of Reporter and OutputCollector.
The map and reduce functions can also throw InterruptedException.
The driver function has more changes, caused by the fact that JobConf is no longer part of the interface. Instead, the work is set up using Job. Variables and values are passed into the Map and Reduce class through ConfJobConf. Also, the code for the Map and Reduce classes is added in using the call Job.setJarByClass(). rather than JobConf. Also, the code for the Map and Reduce classes is added in using the call Job.setJarByClass().
Implement the Hadoop Tool interface
http://grepalex.com/2013/02/25/hadoop-libjars/
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html
* -D to pass in arbitrary HADOOP JOB properties (e.g. -D mapred.reduce.tasks=7 sets the number of reducers to 7) (space after -D)
* -files to put files into the distributed cache
* -archives to put archives (tar, tar.gz, zip, jar) into the distributed cache
* -libjars to put JAR files on the task classpath
public class MyJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), MyJob.class);
// run job ...
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),
new MyJob(), args);
System.exit(res);
}
}
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
http://cxwangyi.blogspot.com/2009/12/wordcount-tutorial-for-hadoop-0201.html
Hadoop C++
http://wiki.apache.org/hadoop/C%2B%2BWordCount
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop
http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html
Makefile
CC = g++
HADOOP_INSTALL=/home/y/libexec/hadoop
#PLATFORM = Linux-amd64-64
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
wordcount2: wordcount.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
debugging hadoop: http://habrahabr.ru/blogs/hi/89365/#habracut
The NameNode stores all information about the file system namespace in a file called FsImage. This file, along with a record of all transactions (referred to as the EditLog), is stored on the local file system of the NameNode. The FsImage and EditLog files are also replicated to protect against file corruption or loss of the NameNode system itsel. The NameNode relies on periodic heartbeat messages from each DataNode
http://blip.tv/search?q=hadoop
Sequence File
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
cascading.org
http://www.cascading.org/ http://www.karmasphere.com/
http://gigaom.com/cloud/twitter-to-open-source-hadoop-like-tool/ Real-Time Stream Processing
Lucene and Solr
Solr is a high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.
http://lucene.apache.org/
High-Scale Archiectures
http://www.royans.net/arch/library/
http://www.jiahenglu.net/course/advancedDataManagement/
http://horicky.blogspot.com/2010/10/scalable-system-design-patterns.html
http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html
Pregel: A System for Large-Scale Graph Processing
http://www.royans.net/arch/pregel-googles-other-data-processing-infrastructure/