Map Reduce Algorithm
Business Problems that Big Data are solving today
- Extracting attributes from unstructured data like images to assist consumers right match during searches - "find red t-shirt"
- How to split the marketing expenditure across various channels - web, ios, android, facebook, twitter etc.
- How much inventory will be obsolete next month in each location?
- What products are on backorder?
- How will this large order affect current inventory levels?
- Detect fraudulent supplier websites
Some common use cases of Map Reduce
- Query log processing
- Crawling, indexing, and search
- Analytics, text processing, and sentiment analysis
- Machine learning (such as Markov chains and the Naive Bayes classifier)
- Recommendation systems
- Document clustering and classification
- Bioinformatics (alignment, re-calibration, germline ingestion, and DNA/RNA sequencing)
- Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)
When MapReduce is suitable for computation
- When you have to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).
- When you need to take advantage of parallel and distributed computing, data storage, and data locality.
- When you can do many tasks independently without synchronization.
- When you can take advantage of sorting and shuffling.
- When you need fault tolerance and you cannot afford job failures.
Here are other scenarios where MapReduce should not be used:
- If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:
F(k + 2) = F(k + 1) + F(k)
- If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.
- If synchronization is required to access shared data.
- If all of your input data fits in memory.
- If one operation depends on other operations.
- If basic computations are processor-intensive.
Basic MapReduce Patterns
- Distributed Task Execution
- Counting and Summing as in Log Analysis, Data Querying
- Collating as in inverted index, ETL
- Filtering, Parsing, Validating as in Log Analysis, Data Querying, ETL, Data Validation
Relational MapReduce Patterns
- Selection or filtering
- Group-By and Aggregation
- Join - map join and hash join
- Data Algorithms Recipe for Scaling up Hadoop and Spark book highlights following: