Map Reduce Algorithm

Business Problems that Big Data are solving today

• Extracting attributes from unstructured data like images to assist consumers right match during searches - "find red t-shirt"
• How to split the marketing expenditure across various channels - web, ios, android, facebook, twitter etc.
• How much inventory will be obsolete next month in each location?
• What products are on backorder?
• How will this large order affect current inventory levels?
• Detect fraudulent supplier websites

Some common use cases of Map Reduce

• Query log processing
• Crawling, indexing, and search
• Analytics, text processing, and sentiment analysis
• Machine learning (such as Markov chains and the Naive Bayes classifier)
• Recommendation systems
• Document clustering and classification
• Bioinformatics (alignment, re-calibration, germline ingestion, and DNA/RNA sequencing)
• Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)

When MapReduce is suitable for computation

• When you have to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).
• When you need to take advantage of parallel and distributed computing, data storage, and data locality.
• When you can do many tasks independently without synchronization.
• When you can take advantage of sorting and shuffling.
• When you need fault tolerance and you cannot afford job failures.

Here are other scenarios where MapReduce should not be used:

• If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:

F(k + 2) = F(k + 1) + F(k)

• If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.
• If synchronization is required to access shared data.
• If all of your input data fits in memory.
• If one operation depends on other operations.
• If basic computations are processor-intensive.

Basic MapReduce Patterns

• Counting and Summing as in Log Analysis, Data Querying
• Collating as in inverted index, ETL
• Filtering, Parsing, Validating as in Log Analysis, Data Querying, ETL, Data Validation
• Sorting

Relational MapReduce Patterns

• Selection or filtering
• Projection
• Union
• Intersection
• Difference
• Group-By and Aggregation
• Join - map join and hash join

Reference:

• Data Algorithms Recipe for Scaling up Hadoop and Spark book highlights following: