UNIT-I: (Wiley & Big Java 4th)
Data structures in Java: Linked List, Stacks, Queues, Sets, Maps; (Chapter 15, 16)
Generics: Generic classes and Type parameters, Implementing Generic Types, Generic Methods, (Chapter 17)
Wrapper Classes (Chapter 7) Concept of Serialization (Chapter 19)
UNIT-II: Working with Big Data: Google File System, Hadoop Distributed File System (HDFS) Building blocks of Hadoop (Name node, Data node, Secondary Name node, Job Tracker, Task Tracker), Introducing and Configuring Hadoop cluster (Local, Pseudo-distributed mode, Fully Distributed mode), Configuring XML files.
UNIT-III: Writing Map Reduce Programs: A Weather Dataset, Understanding Hadoop API for Map Reduce Framework (Old and New), Basic programs of Hadoop Map Reduce: Driver code, Mapper code, Reducer code, Record Reader, Combiner, Practitioner
UNIT-IV: Stream Memory and Spark: Introduction to Streams Concepts– Stream Data Model and Architecture, Stream computing, Sampling Data in a Stream, Filtering Streams, Counting Distinct Elements in a Stream, Introduction to Spark Concept, Spark Architecture and components, Spark installation, Spark RDD (Resilient Distributed Dataset) – Spark RDD operations. Assignment-1, Assignment-2
UNIT-V: Pig: Hadoop Programming Made Easier Admiring the Pig Architecture, going with the Pig Latin Application Flow, working through the ABCs of Pig Latin, Evaluating Local and Distributed Modes of Running Pig Scripts, Checking out the Pig Script Interfaces, Scripting with Pig Latin.
Applying Structure to Hadoop Data with Hive: Saying Hello to Hive, Seeing How the Hive is Put Together, Getting Started with Apache Hive, Examining the Hive Clients, Working with Hive Data Types, Creating and Managing Databases and Tables, Seeing How the Hive Data Manipulation Language Works, Querying and Analyzing data
Textbooks: -
1. Wiley & Big Java 4th Edition, Cay Horstmann, Wiley John Sons, INC
2. Hadoop: The Definitive Guide by Tom White, 3 rd. Edition, O’reilly
Reference Books: -
1. Hadoop in Action by Chuck Lam, MANNING Publ.
2. Hadoop for Dummies by Dirk deRoos, Paul C.Zikopoulos, Roman B.Melnyk,Bruce Brown, Rafael Coss
3. Hadoop in Practice by Alex Holmes, MANNING Publ.
4. Big Data Analytics by Dr. A.Krishna Mohan and Dr.E.Laxmi Lydia
5. Hadoop Map Reduce Cookbook, SrinathPerera, ThilinaGunarathne
Software Links: -
1. Hadoop:http://hadoop.apache.org/
2. Hive: https://cwiki.apache.org/confluence/display/Hive/Home
3. Piglatin: http://pig.apache.org/docs/r0.7.0/tutorial.html
Quiz organized in the class on 06.01.2026 at 3pm Topic: Unit 1&2
Winners:-
Batches 1,2
Runners:-
Batches 6,8
Software Requirements: -
Hadoop: https://hadoop.apache.org/release/2.7.6.html
Java :https://www.oracle.com/java/technologies/javase/javase8u211-later-archive- downloads.html
Eclipse : https://www.eclipse.org/downloads/
1. Implement the following Data structures in Java a) Linked Lists b) Stacks c) Queues d) Set e) Map
2. (i)Perform setting up and Installing Hadoop in its three operating modes: Standalone, Pseudo distributed, Fully distributed (ii)Use web-based tools to monitor your Hadoop setup.
3. Implement the following file management tasks in Hadoop: • Adding files and directories • Retrieving files • Deleting files Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies them into HDFS using one of the above command line utilities.
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
5. Write a map reduce program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with Map Reduce, since it is semi structured and record oriented.
6. Use Map Reduce to find the shortest path between two people in a social graph. Hint: Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a back pointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state of the graph. Iterate until the target node has been reached.
Experiment 7: Week 8:
7. Implement Friends-of-friends algorithm in MapReduce. Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social network .The first job calculates the common friends for each user, and the second job sorts the common friends by the number of connections to your friends.
Experiment 8: Week 9:
8. Implement an iterative PageRank graph algorithm in MapReduce. Hint: PageRank can be implemented by iterating a MapReduce job until the graph has converged. The mappers are responsible for propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank values for each node, and for re-creating the original graph with the updated PageRank values.
Experiment 9: Week 10:
9. Perform an efficient semi-join in MapReduce. Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed Cache and then filter results from the actual MapReduce data source by performing membership queries against the Bloom filter to determine which data source records should be emitted to the reducers.
Experiment 10: Week 11:
10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Experiment 11: Week 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions, and indexes
Experiment 12: Week 13:
1. Implement the following Data structures in Java a) Linked Lists b) Stacks c) Queues d) Set e) Map
1. What is the Java Collections Framework?
Answer: It is a unified architecture for representing and manipulating collections. It contains interfaces (like List, Set, Map), implementations (classes like ArrayList, HashMap), and algorithms (searching and sorting).
2. Difference between an Array and a Linked List?
Answer: Arrays are fixed in size and store elements in contiguous memory locations, allowing O(1) random access. Linked Lists are dynamic in size, store elements in nodes with pointers, and require O(n) time for random access but offer easier insertion/deletion.
3. What is a 'Node' in a Linked List?
Answer: A node is the basic unit of a linked list. It contains two parts: the data (the actual value) and the reference (or pointer) to the next node in the sequence.
4. What is the difference between a Singly Linked List and a Doubly Linked List?
Answer: In a Singly Linked List, each node points only to the next node. In a Doubly Linked List, each node contains two pointers: one pointing to the next node and one pointing to the previous node, allowing traversal in both directions.
5. Explain the principle of a Stack and its primary operations.
Answer: A Stack follows the LIFO (Last In First Out) principle. Its primary operations are:
push(): Adds an element to the top.
pop(): Removes the element from the top.
peek(): Returns the top element without removing it.
6. Explain the principle of a Queue and its primary operations.
Answer: A Queue follows the FIFO (First In First Out) principle. Its primary operations are:
enqueue() (or add/offer in Java): Adds an element to the rear (tail).
dequeue() (or remove/poll in Java): Removes an element from the front (head).
7. How can you implement a Stack using a Linked List in Java?
Answer: You can treat the "head" of the Linked List as the "top" of the stack. When pushing, you add a new node at the head; when popping, you remove the node from the head. This ensures O(1) time complexity.
8. What is the main characteristic of a Set in Java?
Answer: A Set is a collection that cannot contain duplicate elements. It models the mathematical set abstraction. Common implementations include HashSet (unordered) and TreeSet (sorted).
9. How does a HashMap store data?
Answer: A HashMap stores data in key-value pairs. It uses a technique called hashing to convert the key into an index in an underlying array (bucket), allowing for very fast retrieval and insertion (O(1) average case).
10. What happens if two different keys have the same hashcode in a Map?
Answer: This is called a Hash Collision. Java handles this by using a linked list or a balanced tree at that specific bucket index to store multiple entries (Chaining).
11. Difference between HashSet and TreeSet?
Answer: HashSet uses a hash table and is faster (O(1)) but does not maintain any order. TreeSet uses a Red-Black tree structure, is slower (O(log n)), but keeps the elements in a sorted (natural) order.
12. Which Java interface should you use to implement a Queue?
Answer: You should use the java.util.Queue interface. It is typically instantiated using LinkedList or ArrayDeque (e.g., Queue<Integer> q = new LinkedList<>();).
13. Why is it better to use Deque instead of Stack class in Java?
Answer: The Stack class in Java is considered legacy and is synchronized (slower). The Deque interface (using ArrayDeque) is faster and more consistent for LIFO operations.
14. What is the time complexity of searching for an element in a balanced Map vs. a Linked List?
Answer: Searching in a Map (HashMap) is generally O(1), whereas searching in a Linked List is O(n) because you must traverse the list sequentially.
2. (i)Perform setting up and Installing Hadoop in its three operating modes: Standalone, Pseudo distributed, Fully distributed (ii)Use web-based tools to monitor your Hadoop setup.
1. Name and briefly explain the three operating modes of Hadoop.
Answer:
Standalone (Local) Mode: The default mode where Hadoop runs as a single Java process using the local file system instead of HDFS. Used for debugging.
Pseudo-Distributed Mode: Runs on a single machine but simulates a cluster. Each Hadoop daemon (NameNode, DataNode, etc.) runs in a separate Java process.
Fully Distributed Mode: The production mode where Hadoop runs on a cluster of multiple machines (Master and Slaves).
2. What is the main difference between Standalone and Pseudo-distributed mode?
Answer: Standalone mode uses the Local File System and a single JVM, whereas Pseudo-distributed mode uses HDFS and separate JVMs for each daemon, even though they are all on the same physical machine.
3. Which software prerequisites are mandatory for installing Hadoop?
Answer: * Java Development Kit (JDK): Hadoop is Java-based.
SSH (Secure Shell): Required for Hadoop scripts to manage remote (or local) daemons without manual password entry.
4. What is the significance of the core-site.xml file?
Answer: It contains configuration settings for Hadoop Core, such as the I/O settings and the default filesystem (HDFS) URL (e.g., hdfs://localhost:9000).
5. Why do we need to format the NameNode before starting the cluster for the first time?
Answer: Formatting the NameNode (hdfs namenode -format) initializes the directory structure for the HDFS metadata (Namespace and Edit Logs). It creates a new, empty file system. Caution: Formatting an existing NameNode will delete all data in the HDFS.
6. Which command is used to start all HDFS and YARN daemons at once?
Answer: While individual scripts exist, start-all.sh (or a combination of start-dfs.sh and start-yarn.sh) is commonly used to launch all required processes.
7. What are the five primary daemons in a standard Hadoop 2.x/3.x setup?
Answer:
NameNode: Master of HDFS (stores metadata).
DataNode: Slave of HDFS (stores actual data).
Secondary NameNode: Performs checkpoints of the metadata.
ResourceManager: Master of YARN (allocates resources).
NodeManager: Slave of YARN (executes tasks on the node).
8. How do you verify which Hadoop daemons are currently running on your system?
Answer: By using the jps (Java Virtual Machine Process Status Tool) command in the terminal.
9. How can you monitor the health of the HDFS via a web browser?
Answer: By accessing the NameNode Web UI.
For Hadoop 3.x, the default port is 9870 (e.g., http://localhost:9870).
For Hadoop 2.x, the default port was 50070.
10. Which tool is used to monitor running jobs and cluster resources?
Answer: The YARN ResourceManager Web UI. It provides details on application status, node health, and resource allocation. The default port is 8088 (e.g., http://localhost:8088).
11. What information can you find in the NameNode Web UI?
Answer: You can view the HDFS capacity, the number of live/dead DataNodes, block information, and browse the HDFS file directory.
12. What is SSH Passphraseless login, and why is it required?
Answer: It allows the master node to log into slave nodes (or itself) without being prompted for a password. It is essential for Hadoop scripts to start, stop, and manage daemons across the cluster automatically.
13. In a Fully Distributed mode, where is the data actually stored?
Answer: The actual data is stored on the DataNodes. The NameNode only stores the "map" or "index" (metadata) that tells Hadoop where each piece of data is located across the DataNodes.
3. Implement the following file management tasks in Hadoop: • Adding files and directories • Retrieving files • Deleting files Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies them into HDFS using one of the above command line utilities.
1. What does HDFS stand for and what is its primary purpose?
Answer: HDFS stands for Hadoop Distributed File System. Its purpose is to store very large data sets (Terabytes to Petabytes) reliably across a cluster of commodity hardware and to provide high-throughput access to application data.
2. Explain the "Write Once, Read Many" model of HDFS.
Answer: HDFS is designed for batch processing rather than interactive use. A file once created, written, and closed need not be changed. This simplifies data coherency issues and enables high-throughput data access.
3. How do you create a new directory in HDFS?
Answer: Use the -mkdir command.
Example: hdfs dfs -mkdir /user/data
4. What is the difference between the put and copyFromLocal commands?
Answer: * copyFromLocal: Specifically used to copy files from the local file system to HDFS.
put: A more general command that can copy from the local file system or standard input (stdin) to HDFS.
In most basic scenarios, they perform the same function.
5. How do you retrieve a file from HDFS back to your local machine?
Answer: Use the -get or -copyToLocal command.
Example: hdfs dfs -get /user/data/logs.txt /home/user/downloads
6. Which command is used to list the contents of a directory in HDFS?
Answer: Use the -ls command.
Example: hdfs dfs -ls /user
7. How do you delete a file and a directory in HDFS?
Answer:
To delete a file: hdfs dfs -rm /path/to/file
To delete a non-empty directory: hdfs dfs -rm -r /path/to/dir (The -r flag stands for recursive).
8. Explain the internal process when you "Add/Upload" a file to HDFS.
Answer: 1. The Client asks the NameNode for permission to create the file.
2. The NameNode checks if the file exists and gives the client the addresses of DataNodes where blocks should be stored.
3. The Client splits the file into blocks (default 128MB) and writes them directly to the DataNodes.
4. DataNodes replicate the blocks to other DataNodes based on the replication factor.
9. What happens to a deleted file in Hadoop? Is it gone forever immediately?
Answer: If the Trash feature is enabled, the file is moved to a .Trash folder in the user's home directory in HDFS. It stays there for a configured interval before being permanently deleted. If Trash is not enabled, it is deleted immediately.
10. What is the default block size in Hadoop 2.x/3.x, and why is it so large?
Answer: The default is 128 MB. It is large to minimize the cost of "seeks" (the time taken to find the data on the disk). By making the block large, the time to transfer the data is significantly longer than the time to find the start of the block.
11. How do you check the disk usage of a specific directory in HDFS?
Answer: Use the -du command.
Example: hdfs dfs -du -h /user/data (The -h makes the output human-readable, e.g., in MB or GB).
12. Can you append data to an existing file in HDFS?
Answer: Yes, using the -appendToFile command. However, HDFS does not support random updates (editing the middle of a file).
Example: hdfs dfs -appendToFile local_file.txt /hdfs/target_file.txt
13. What command would you use to see the last few lines of a large log file stored in HDFS?
Answer: Use the -tail command.
Example: hdfs dfs -tail /user/logs/system.log
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
1. What is MapReduce?
Answer: MapReduce is a software framework and programming model used for processing vast amounts of data in parallel across a Hadoop cluster. It divides the task into two main phases: Map (filtering and sorting) and Reduce (summarizing).
2. Explain the "Divide and Conquer" strategy in MapReduce.
Answer: The framework splits the input data into smaller independent chunks (splits) which are processed by Map tasks in a completely parallel manner. The framework then sorts the outputs of the maps, which are then input to the Reduce tasks.
3. Describe the specific role of the Mapper in a Word Count program.
Answer: The Mapper takes a line of text as input, tokenizes it into individual words, and emits a key-value pair where the key is the word and the value is 1.
Input: "Apple Banana Apple"
Output: <Apple, 1>, <Banana, 1>, <Apple, 1>
4. What happens during the "Shuffle and Sort" phase?
Answer: This is the intermediate phase between Map and Reduce. The framework groups all values associated with the same key together and sorts them.
Input to Reducer: <Apple, [1, 1]>, <Banana, [1]>
5. Describe the role of the Reducer in Word Count.
Answer: The Reducer receives the grouped keys and their list of values. It iterates through the list, sums the values (the 1s), and emits a final key-value pair containing the word and its total count.
Final Output: Apple 2, Banana 1
6. Which Java classes/interfaces are commonly extended to write a MapReduce program?
Answer: You typically extend the Mapper class and the Reducer class provided by the org.apache.hadoop.mapreduce package.
7. Why does Hadoop use Text and IntWritable instead of standard Java String and Integer?
Answer: Hadoop uses specialized Writable types for network efficiency. These classes are optimized for serialization (converting objects to a stream of bytes), which is necessary because data must be sent across different nodes in the cluster.
8. What is the purpose of the 'Driver' class in your code?
Answer: The Driver class configures the MapReduce job. It specifies the Mapper and Reducer classes, the input/output paths, and the data types for keys and values. It acts as the "entry point" that submits the job to the cluster.
9. What is a Combiner, and how does it help the Word Count program?
Answer: A Combiner is a "Mini-Reducer" that runs on the Map side. It aggregates the word counts locally at the Mapper level (e.g., changing <Apple, 1>, <Apple, 1> to <Apple, 2>) before sending data over the network, which significantly reduces network congestion.
10. What is an InputSplit in MapReduce?
Answer: An InputSplit is a logical chunk of data that is processed by a single Mapper. Usually, one InputSplit corresponds to one HDFS block (e.g., 128MB).
11. What happens if a Map task fails during the Word Count execution?
Answer: The ResourceManager (specifically the ApplicationMaster) detects the failure and automatically reschedules the task on another healthy node. This provides fault tolerance.
5. Write a map reduce program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with Map Reduce, since it is semi structured and record oriented.
1. Why is weather data considered an ideal candidate for MapReduce analysis?
Answer: Weather data is typically semi-structured (consistent format like CSV or fixed-width) and record-oriented. Since each sensor reading is independent of others, the data can be split into chunks and processed in parallel across many nodes without requiring communication between them.
2. In the "Max Temperature" problem, what are the Input and Output Key-Value pairs?
Answer: * Input (to Mapper): Key: Line Offset (LongWritable); Value: The entire line of log data (Text).
Intermediate (Mapper Output): Key: Year (Text); Value: Temperature (IntWritable).
Final (Reducer Output): Key: Year (Text); Value: Maximum Temperature (IntWritable).
3. What is the primary responsibility of the Mapper in this experiment?
Answer: The Mapper's job is Data Extraction and Cleaning. It parses each line to extract the year and temperature. It must also filter out corrupted or missing data (often represented by specific codes like +9999) to ensure the analysis is accurate.
4. How does the Reducer find the maximum temperature for a specific year?
Answer: The Reducer receives a key (Year) and an iterator containing all temperatures for that year. It initializes a variable (e.g., max_temp) to a very low value and iterates through the list, updating the variable whenever it finds a higher temperature.
5. What happens during the "Shuffle and Sort" phase in this weather program?
Answer: The framework takes all the <Year, Temp> pairs from all Mappers, sorts them by Year, and ensures that all temperatures for "1950" are sent to the same Reducer, while all temperatures for "1951" go to another (or the same one in a sequence).
6. How would the program change if you were asked to find the Average temperature instead of the Maximum?
Answer: The Mapper would remain exactly the same. However, the Reducer would change: it would need to maintain a running sum and a count of all temperatures for that year. Finally, it would divide the sum by the count before emitting the result.
7. Can a Combiner be used in this weather data experiment? Why?
Answer: Yes. Since the MAX() function is associative and commutative, a Combiner can find the local maximum temperature at each Mapper node. This significantly reduces the amount of data sent over the network to the Reducer, improving performance.
8. What is the NCDC format often used in these experiments?
Answer: It is a fixed-width format provided by the National Climatic Data Center. Because the fields (year, temp, quality) are always at the same character positions, we can use String.substring() in Java to extract them efficiently.
9. How does the program handle a scenario where a temperature is recorded as "+0025"?
Answer: In the Java code, we typically handle the sign character separately or use Integer.parseInt() on the substring starting from the sign to correctly convert the string into a signed integer.
10. What is 'Data Locality' in the context of this experiment?
Answer: Hadoop tries to run the Mapper tasks on the same nodes where the weather data blocks are stored in HDFS. This minimizes moving large amounts of raw log data across the network.
6. Use Map Reduce to find the shortest path between two people in a social graph. Hint: Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a back pointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state of the graph. Iterate until the target node has been reached.
1. How do you represent a social graph for a MapReduce job?
Answer: We use an Adjacency List. Each record (node) contains:
The Node ID.
A list of its neighbors (edges).
The current distance from the source node (initialized to infinity, except for the source which is 0).
A "back-pointer" or parent node to reconstruct the path later.
2. Why is this called an "Iterative" MapReduce process?
Answer: Unlike Word Count, which runs once, finding the shortest path requires multiple MapReduce passes. Each pass (iteration) explores the graph one "hop" further from the source. We repeat the cycle until the target node is reached or no more distances can be updated.
3. What is the primary role of the Mapper in this BFS implementation?
Answer: The Mapper has two tasks:
Pass-along the graph structure: It emits the node's adjacency list so the graph isn't lost.
Propagate distance: For each neighbor of a "reachable" node, it emits a new temporary node object with a distance of d + 1. This effectively "signals" to the neighbors how far they are from the source.
4. What does the Reducer do when it receives multiple distances for the same node?
Answer: The Reducer receives all messages for a specific Node ID. It compares all incoming distances and selects the minimum value. It then reconstructs the node with this new minimum distance and the original adjacency list to be used in the next iteration.
5. How do you know when to stop the iterations?
Answer: There are two main termination conditions:
Target Reached: If the distance to the specific "Target Person" is no longer infinity.
Convergence: When a full iteration completes and no node's distance has changed, meaning the entire reachable graph has been explored.
6. What is the "Back-pointer" used for?
Answer: The back-pointer stores the ID of the node that first reached the current node with the shortest distance. Once the target is found, we follow these pointers backward to reconstruct the actual path (e.g., Person A -> Person B -> Person C).
7. What is the biggest performance bottleneck in this MapReduce experiment?
Answer: The overhead of starting a new MapReduce job for every iteration. Each step requires writing data to HDFS and reading it back, which involves significant I/O and network latency.
8. How does the "Shuffle" phase help in this algorithm?
Answer: The shuffle phase ensures that all "signals" (distances) sent to a specific person in the social graph arrive at the same Reducer, allowing that Reducer to pick the shortest one among all possibilities.
9. Can we use a Combiner here?
Answer: Yes. A Combiner can be used to find the local minimum distance for a node at the Mapper side before sending it over the network, reducing the total number of records the Reducer has to process.
10. What happens if the graph is not connected?
Answer: The algorithm will only find paths to nodes in the same "component" as the source. For nodes in other components, the distance will remain at Infinity (${\infty}$) and the algorithm will stop once all reachable nodes are visited.
Experiment 7: Week 8:
7. Implement Friends-of-friends algorithm in MapReduce. Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social network .The first job calculates the common friends for each user, and the second job sorts the common friends by the number of connections to your friends.
The Friends-of-Friends (FoF) algorithm is the core logic behind social media recommendation systems (like "People You May Know"). Implementing this in MapReduce requires a two-step pipeline because you first need to identify candidates and then rank them.
Here are the viva-voce questions and answers for this experiment.
1. What is the fundamental goal of the Friends-of-Friends (FoF) algorithm?
Answer: The goal is to suggest new connections to a user by identifying people who are not currently their friends but share one or more mutual friends. The more mutual friends two people share, the higher the recommendation priority.
2. Why are two MapReduce jobs required for this experiment?
Answer: * Job 1 (Discovery): Identifies all pairs of people who share at least one mutual friend.
Job 2 (Aggregating & Sorting): Counts how many mutual friends each pair has and sorts the suggestions in descending order of that count.
3. How does the Mapper in Job 1 process a user's adjacency list?
Answer: If User A is friends with B and C, the Mapper emits a "Mutual Friend" path for every combination of their friends. For the list A: B, C, it emits <(B, C), A>. This tells the system that B and C share A as a common friend.
Crucial Step: It must also emit the existing friendship (e.g., <(A, B), -1>) so the Reducer knows not to recommend people who are already friends.
4. What is the output of the Reducer in Job 1?
Answer: The Reducer receives a pair of users as the key and a list of their mutual friends as the value. It filters out pairs that are already friends and outputs: <UserPair, List of Mutual Friends>.
5. What is the Input and Output of the Mapper in Job 2?
Answer: * Input: The output of Job 1 (e.g., <(B, C), {A, D, E}>).
Output: It reshapes the data so it can be sorted by User ID. It emits <User B, (User C, Count: 3)> and <User C, (User B, Count: 3)>.
6. How does the Reducer in Job 2 finalize the recommendations?
Answer: For each user, the Reducer collects all candidate FoFs and their mutual friend counts. It sorts these candidates by the count (highest first) and emits the top recommendations for that specific user.
7. How do you handle the "Already Friends" problem in MapReduce?
Answer: During Job 1, the Mapper emits a special flag (like -1 or a boolean) for direct friends. If the Reducer sees this flag in the list of values for a pair <B, C>, it ignores that pair entirely because they don't need a recommendation—they are already connected.
8. What is the "Square-Law" problem in FoF MapReduce?
Answer: If a user has $N$ friends, the Mapper emits $N^2$ pairs. For "celebrity" nodes with thousands of friends, this creates a massive explosion of intermediate data (data skew), which can slow down the Shuffle phase and overwhelm the Reducers.
9. Can a Combiner be used in Job 2?
Answer: No, a standard Combiner cannot be used for sorting. However, an In-Memory Combiner or a custom partitioner can sometimes be used to manage the counts before they reach the Reducer to save bandwidth.
10. How would you represent the "User Pair" as a Key to ensure (A, B) and (B, A) go to the same Reducer?
Answer: You must sort the User IDs alphabetically when creating the key. Regardless of whether the input is A-B or B-A, the key is always emitted as (Min(ID1, ID2), Max(ID1, ID2)). This ensures all mutual friends for that specific pair are aggregated together.
Experiment 8: Week 9:
8. Implement an iterative PageRank graph algorithm in MapReduce. Hint: PageRank can be implemented by iterating a MapReduce job until the graph has converged. The mappers are responsible for propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank values for each node, and for re-creating the original graph with the updated PageRank values.
1. What is the fundamental idea behind the PageRank algorithm?
Answer: PageRank measures the importance of a node (web page) based on the quantity and quality of links pointing to it. A node has a high rank if many other nodes point to it, or if a few highly-ranked nodes point to it. It treats a link from Page A to Page B as a "vote" of confidence.
2. Why is PageRank an "Iterative" algorithm?
Answer: The rank of a page depends on the ranks of its neighbors, which in turn depend on their neighbors. Because these values are interdependent, we start with an initial estimate (usually $1/N$) and run multiple MapReduce passes to refine the values until they stabilize (converge).
3. How is the graph data structured for the Mapper?
Answer: Each input record represents a node and contains:
Node ID
Current PageRank (PR)
Adjacency List (the list of out-links/neighbors).
4. What are the two types of output the Mapper produces?
Answer: 1. Rank Contributions: For a node with rank $PR$ and $n$ neighbors, the mapper emits <NeighborID, PR/n> for each neighbor. This "distributes" the current node's rank among its links.
2. Graph Structure: It emits the original adjacency list <NodeID, AdjacencyList> so the reducer can reconstruct the graph for the next iteration.
5. What is the formula used by the Reducer to calculate the new PageRank?
Answer: The Reducer sums all the rank contributions ($S$) it received for a Node ID and applies the Damping Factor ($d$, usually 0.85):
$NewPR = (1 - d) + d \times S$
This formula accounts for the "Random Surfer Model," where a user might stop clicking links and jump to a random page.
6. How do you determine when the PageRank algorithm has "Converged"?
Answer: Convergence is reached when the PageRank values of all nodes change by less than a predefined threshold (e.g., 0.001) between two consecutive iterations. In many classroom settings, a fixed number of iterations (e.g., 10 or 20) is used.
7. What are "Dangling Nodes" and how do they affect the algorithm?
Answer: Dangling nodes are pages with no outgoing links. They "trap" PageRank because they receive rank but cannot pass it on. In a full implementation, the rank from dangling nodes is collected and redistributed equally among all nodes in the graph to maintain the total PageRank sum.
8. What is the significance of the Damping Factor (0.85)?
Answer: It represents the probability that a person will continue clicking links. The value $(1 - d)$ represents the $15\%$ chance that a user will jump to a completely random page. This prevents "Rank Sinks" (groups of pages that only link to each other) from absorbing all the PageRank in the system.
9. Why is the "Shuffle" phase critical for PageRank?
Answer: The Shuffle phase is responsible for collecting all the partial rank "votes" from various Mappers and delivering them to the specific Reducer responsible for that Node ID. Without this, the Reducer couldn't calculate the new total rank.
10. How does the scale of the graph affect the MapReduce implementation?
Answer: As the graph grows to billions of nodes (like the actual web), the amount of data transferred during the Shuffle phase becomes massive. Optimizations like Combiners can be used to sum partial PageRanks at the Mapper side to reduce network traffic.
Experiment 9: Week 10:
9. Perform an efficient semi-join in MapReduce. Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed Cache and then filter results from the actual MapReduce data source by performing membership queries against the Bloom filter to determine which data source records should be emitted to the reducers.
1. What is a Semi-Join in the context of distributed data?
Answer: A semi-join is an optimization where you only send records from the large "Main Table" to the Reducer if they have a matching key in the "Reference Table." Unlike a full join, the goal is to filter out non-matching rows as early as possible (at the Mapper) to save bandwidth.
2. What is a Bloom Filter?
Answer: A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you if an element is definitely not in the set or possibly in the set. It never gives a "false negative" but may occasionally give a "false positive."
3. How is the Distributed Cache used in this experiment?
Answer: The Distributed Cache is used to distribute the small Bloom Filter file to every worker node in the cluster. This allows the Mappers to load the filter into memory locally before they start processing the large data blocks.
4. Describe the specific role of the Mapper in this Semi-Join.
Answer: The Mapper reads the large data source record-by-record. For each record, it extracts the join key and performs a membership query against the Bloom Filter loaded from the cache.
If the filter returns False: The record is discarded immediately.
If the filter returns True: The record is emitted to the Reducer.
5. Why is this method more "efficient" than a standard Reduce-Side Join?
Answer: In a standard join, all data from both tables is sent over the network to the Reducers. In a semi-join with a Bloom Filter, a huge percentage of non-matching data is filtered out at the Map side, drastically reducing the "Shuffle" overhead, which is usually the slowest part of a MapReduce job.
6. What happens if the Bloom Filter gives a "False Positive"?
Answer: If the filter incorrectly says a key exists when it doesn't, that record will be sent to the Reducer. However, the Reducer will perform the final actual join and realize there is no match, so the final output remains correct. The only "penalty" for a false positive is a tiny bit of unnecessary network traffic.
7. How do you create the Bloom Filter used in this experiment?
Answer: Typically, a separate, small MapReduce job (or a local script) reads the small "Reference Table," inserts all its join keys into a Bloom Filter bit-array, and serializes that array into a file to be placed in the Distributed Cache.
8. When should you not use this Bloom Filter Semi-Join approach?
Answer: If the "Reference Table" is so large that the resulting Bloom Filter cannot fit into the RAM of the Mapper nodes, or if the join keys have very high collision rates, the effectiveness of this optimization decreases.
9. Which Hadoop class provides the Bloom Filter implementation?
Answer: Hadoop provides the org.apache.hadoop.util.bloom.BloomFilter class, which includes methods like add(Key key) for creation and membershipTest(Key key) for querying.
10. How do you access the Distributed Cache inside the Mapper's code?
Answer: You typically override the setup() method of the Mapper. Inside setup(), you use the context.getCacheFiles() method to retrieve the URI of the Bloom Filter file and load it into an object.
11. Does the Reducer's logic change in a Semi-Join?
Answer: Generally, no. The Reducer still performs the actual join logic (e.g., merging values from Table A and Table B). The Semi-Join is primarily a "Map-side optimization" to ensure the Reducer only receives data that has a high probability of being useful.
Experiment 10: Week 11:
10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
1. What is a Semi-Join in the context of distributed data?
Answer: A semi-join is an optimization where you only send records from the large "Main Table" to the Reducer if they have a matching key in the "Reference Table." Unlike a full join, the goal is to filter out non-matching rows as early as possible (at the Mapper) to save bandwidth.
2. What is a Bloom Filter?
Answer: A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you if an element is definitely not in the set or possibly in the set. It never gives a "false negative" but may occasionally give a "false positive."
3. How is the Distributed Cache used in this experiment?
Answer: The Distributed Cache is used to distribute the small Bloom Filter file to every worker node in the cluster. This allows the Mappers to load the filter into memory locally before they start processing the large data blocks.
4. Describe the specific role of the Mapper in this Semi-Join.
Answer: The Mapper reads the large data source record-by-record. For each record, it extracts the join key and performs a membership query against the Bloom Filter loaded from the cache.
If the filter returns False: The record is discarded immediately.
If the filter returns True: The record is emitted to the Reducer.
5. Why is this method more "efficient" than a standard Reduce-Side Join?
Answer: In a standard join, all data from both tables is sent over the network to the Reducers. In a semi-join with a Bloom Filter, a huge percentage of non-matching data is filtered out at the Map side, drastically reducing the "Shuffle" overhead, which is usually the slowest part of a MapReduce job.
6. What happens if the Bloom Filter gives a "False Positive"?
Answer: If the filter incorrectly says a key exists when it doesn't, that record will be sent to the Reducer. However, the Reducer will perform the final actual join and realize there is no match, so the final output remains correct. The only "penalty" for a false positive is a tiny bit of unnecessary network traffic.
7. How do you create the Bloom Filter used in this experiment?
Answer: Typically, a separate, small MapReduce job (or a local script) reads the small "Reference Table," inserts all its join keys into a Bloom Filter bit-array, and serializes that array into a file to be placed in the Distributed Cache.
8. When should you not use this Bloom Filter Semi-Join approach?
Answer: If the "Reference Table" is so large that the resulting Bloom Filter cannot fit into the RAM of the Mapper nodes, or if the join keys have very high collision rates, the effectiveness of this optimization decreases.
9. Which Hadoop class provides the Bloom Filter implementation?
Answer: Hadoop provides the org.apache.hadoop.util.bloom.BloomFilter class, which includes methods like add(Key key) for creation and membershipTest(Key key) for querying.
10. How do you access the Distributed Cache inside the Mapper's code?
Answer: You typically override the setup() method of the Mapper. Inside setup(), you use the context.getCacheFiles() method to retrieve the URI of the Bloom Filter file and load it into an object.
11. Does the Reducer's logic change in a Semi-Join?
Answer: Generally, no. The Reducer still performs the actual join logic (e.g., merging values from Table A and Table B). The Semi-Join is primarily a "Map-side optimization" to ensure the Reducer only receives data that has a high probability of being useful.
Experiment 11: Week 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions, and indexes
1. What is Apache Hive, and how does it differ from Pig or MapReduce?
Answer: Hive is a data warehouse infrastructure that provides an SQL-like interface (HiveQL) to query data stored in HDFS.
MapReduce requires writing Java code.
Pig uses a data flow language (Pig Latin).
Hive is designed for analysts comfortable with SQL and is best for structured data.
2. What is the "Metastore" in Hive?
Answer: The Metastore is a central repository that stores the metadata of Hive tables (e.g., table names, column names, data types, and HDFS mapping). It typically uses a relational database like MySQL or Derby to store this information.
3. What is the difference between a Managed (Internal) Table and an External Table?
Answer: * Managed Table: Hive manages both the metadata and the actual data. If you DROP the table, both the metadata and the data in HDFS are deleted.
External Table: Hive only manages the metadata. If you DROP the table, only the metadata is removed; the actual data remains intact in HDFS.
4. How do you "Alter" a table to add a new column?
Answer: You use the ALTER TABLE command.
Example: ALTER TABLE employee ADD COLUMNS (dept STRING);
5. What is the purpose of "Partitions" in Hive?
Answer: Partitioning is a way of organizing tables into horizontal slices based on the values of a specific column (like date or country). This significantly improves query performance by allowing Hive to skip unrelated data folders (Partition Pruning).
6. What is a "View" in Hive, and does it store data?
Answer: A View is a logical construct (a saved query). It does not store data physically. When you query a view, Hive executes the underlying query against the base tables. It is useful for simplifying complex queries and for security.
7. What are UDF, UDAF, and UDTF?
Answer: These are types of functions in Hive:
UDF (User Defined Function): Takes one row as input and gives one row as output (e.g., UPPER()).
UDAF (User Defined Aggregate Function): Takes multiple rows and gives one result (e.g., SUM(), COUNT()).
UDTF (User Defined Table-Generating Function): Takes one row and produces multiple rows (e.g., EXPLODE()).
8. Why would you create an "Index" in Hive?
Answer: Indexes are created to speed up the search and query performance on specific columns. However, in modern Hive (versions 3.x+), indexes have largely been replaced by Columnar formats (like ORC or Parquet) and Materialized Views.
9. Which command is used to see the structure of a table including its columns and data types?
Answer: The DESCRIBE command.
Example: DESCRIBE employee; (or DESCRIBE FORMATTED employee; for more details).
10. How do you load data from a local file into a Hive table?
Answer: Using the LOAD DATA command.
Example: LOAD DATA LOCAL INPATH '/home/user/data.txt' INTO TABLE employee;
11. What is the default port for the Hive Web UI (HiveServer2)?
Answer: The default port is 10002.
12. What happens "behind the scenes" when you execute a HiveQL query?
Answer: The Hive Driver sends the query to a Compiler, which talks to the Metastore. The compiler creates an execution plan which is converted into a series of MapReduce, Tez, or Spark jobs to be executed on the Hadoop cluster.
Experiment 12: Week 13:
A) Implement the following Data structures in Java: 🡪 Linked Lists
B) Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
A) Implement the following Data structures in Java: 🡪 Stacks
B) Write a map reduce program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good Candidate for analysis with Map Reduce, since it is semi structured and record oriented.
A) Implement the following Data structures in Java: 🡪 Queues
B) Use Map Reduce to find the shortest path between two people in a social graph. Hint: Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a back pointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state of the graph. Iterate until the target node has been reached.
A) Implement the following Data structures in Java: 🡪 Set
B) Implement Friends-of-friends algorithm in MapReduce.
Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social network. The first job calculates the common friends for each user, and the second job sorts the common friends by the number of connections to your friends.
A) Implement the following Data structures in Java: 🡪 Map
B) Implement an iterative PageRank graph algorithm in MapReduce. Hint: PageRank can be implemented by iterating a MapReduce job until the graph has converged. The mappers are responsible for propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank values for each node, and for re-creating the original graph with the updated PageRank values.
A) Implement the following Data structures in Java: 🡪 Linked Lists
B) Perform an efficient semi-join in MapReduce. Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed Cache, and then filter results from the actual MapReduce data source by performing membership queries against the Bloom filter to determine which data source records should be emitted to the reducers.
A) Implement the following Data structures in Java: 🡪 Stacks
B) Write Pig Latin scripts to sort, group, join, project, and filter data.
A) Implement the following Data structures in Java: 🡪 Queues
B) Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.
A) Implement the following Data structures in Java: 🡪 Set
B) Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
A) Implement the following Data structures in Java: 🡪 Map
B) Write a map reduce program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good Candidate for analysis with Map Reduce, since it is semi structured and record oriented.