UNIT-I: (Wiley & Big Java 4th)
Data structures in Java: Linked List, Stacks, Queues, Sets, Maps; (Chapter 15, 16)
Generics: Generic classes and Type parameters, Implementing Generic Types, Generic Methods,
(Chapter 17)
Wrapper Classes (Chapter 7)
Concept of Serialization (Chapter 19)
UNIT-II: Working with Big Data: Google File System, Hadoop Distributed File System (HDFS) Building blocks of Hadoop (Name node, Data node, Secondary Name node, Job Tracker, Task Tracker), Introducing and Configuring Hadoop cluster (Local, Pseudo-distributed mode, Fully Distributed mode), Configuring XML files.
UNIT-III: Writing Map Reduce Programs: A Weather Dataset, Understanding Hadoop API for Map Reduce Framework (Old and New), Basic programs of Hadoop Map Reduce: Driver code, Mapper code, Reducer code, Record Reader, Combiner, Practitioner
UNIT-IV: Stream Memory and Spark: Introduction to Streams Concepts– Stream Data Model and Architecture, Stream computing, Sampling Data in a Stream, Filtering Streams, Counting Distinct Elements in a Stream, Introduction to Spark Concept, Spark Architecture and components, Spark installation, Spark RDD (Resilient Distributed Dataset) – Spark RDD operations.
UNIT-V: Pig: Hadoop Programming Made Easier Admiring the Pig Architecture, going with the Pig Latin Application Flow, working through the ABCs of Pig Latin, Evaluating Local and Distributed Modes of Running Pig Scripts, Checking out the Pig Script Interfaces, Scripting with Pig Latin.
Applying Structure to Hadoop Data with Hive: Saying Hello to Hive, Seeing How the Hive is Put Together, Getting Started with Apache Hive, Examining the Hive Clients, Working with Hive Data Types, Creating and Managing Databases and Tables, Seeing How the Hive Data Manipulation Language Works, Querying and Analyzing data
Textbooks: -
1. Wiley & Big Java 4th Edition, Cay Horstmann, Wiley John Sons, INC
2. Hadoop: The Definitive Guide by Tom White, 3 rd. Edition, O’reilly
Reference Books: -
1. Hadoop in Action by Chuck Lam, MANNING Publ.
2. Hadoop for Dummies by Dirk deRoos, Paul C.Zikopoulos, Roman B.Melnyk,Bruce Brown, Rafael Coss
3. Hadoop in Practice by Alex Holmes, MANNING Publ.
4. Big Data Analytics by Dr. A.Krishna Mohan and Dr.E.Laxmi Lydia
5. Hadoop Map Reduce Cookbook, SrinathPerera, ThilinaGunarathne
Software Links: -
1. Hadoop:http://hadoop.apache.org/
2. Hive: https://cwiki.apache.org/confluence/display/Hive/Home
3. Piglatin: http://pig.apache.org/docs/r0.7.0/tutorial.html
import java.util.LinkedList;
import java.util.Scanner;
public class UniversalLinkedList
{
public static void main(String[] args)
{
// 'Object' is the parent of all classes in Java.
// Using <Object> allows this list to store Integers, Strings, Doubles, etc.
LinkedList<Object> anyList = new LinkedList<>();
Scanner sc = new Scanner(System.in);
int choice;
do {
System.out.println("\n--- Universal LinkedList (Accepts Any Type) ---");
System.out.println("1. Add Integer\n2. Add Double\n3. Add String\n4. Add Boolean\n5. Display & Size\n6. Exit");
System.out.print("Enter choice: ");
choice = sc.nextInt();
switch (choice) {
case 1 -> {
System.out.print("Enter an integer: ");
// sc.nextInt() returns a primitive 'int', Java boxes it into an 'Integer' object
anyList.add(sc.nextInt());
}
case 2 -> {
System.out.print("Enter a double (e.g. 10.5): ");
// sc.nextDouble() boxes into a 'Double' object
anyList.add(sc.nextDouble());
}
case 3 -> {
System.out.print("Enter a string: ");
anyList.add(sc.next()); // Strings are already objects
}
case 4 -> {
System.out.print("Enter a boolean (true/false): ");
anyList.add(sc.nextBoolean()); // Boxes into a 'Boolean' object
}
case 5 -> {
System.out.println("List Contents: " + anyList);
System.out.println("Total Items: " + anyList.size());
}
}
} while (choice!= 6);
sc.close();
}
}
Software Requirements: -
Hadoop: https://hadoop.apache.org/release/2.7.6.html
Java :https://www.oracle.com/java/technologies/javase/javase8u211-later-archive- downloads.html
Eclipse : https://www.eclipse.org/downloads/
Experiment 1: Week 1, 2:
1. Implement the following Data structures in Java a) Linked Lists b) Stacks c) Queues d) Set e) Map
Experiment 2: Week 3:
2. (i)Perform setting up and Installing Hadoop in its three operating modes: Standalone, Pseudo distributed, Fully distributed (ii)Use web-based tools to monitor your Hadoop setup.
Experiment 3: Week 4:
3. Implement the following file management tasks in Hadoop: • Adding files and directories • Retrieving files • Deleting files Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies them into HDFS using one of the above command line utilities.
Experiment 4: Week 5:
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
Experiment 5: Week 6: 5. Write a map reduce program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with Map Reduce, since it is semi structured and record oriented.
Experiment 6: Week 7:
6. Use Map Reduce to find the shortest path between two people in a social graph. Hint: Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a back pointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state of the graph. Iterate until the target node has been reached.
Experiment 7: Week 8:
7. Implement Friends-of-friends algorithm in MapReduce. Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social network .The first job calculates the common friends for each user, and the second job sorts the common friends by the number of connections to your friends.
Experiment 8: Week 9:
8. Implement an iterative PageRank graph algorithm in MapReduce. Hint: PageRank can be implemented by iterating a MapReduce job until the graph has converged. The mappers are responsible for propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank values for each node, and for re-creating the original graph with the updated PageRank values.
Experiment 9: Week 10:
9. Perform an efficient semi-join in MapReduce. Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed Cache and then filter results from the actual MapReduce data source by performing membership queries against the Bloom filter to determine which data source records should be emitted to the reducers.
Experiment 10: Week 11:
10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Experiment 12: Week 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions, and indexes
First, list the installed Java packages to know exactly what to remove.
Note the package names (e.g., openjdk-11-jdk, oracle-java8-installer).
Run the command below corresponding to your installation type.
Option A: Standard Removal (OpenJDK)
This removes all OpenJDK versions (common for most Ubuntu users).
Option B: Oracle Java Removal
If you installed Oracle Java specifically:
To ensure no "dead" links remain in your system alternatives:
(If you see an error saying "no alternatives for java", that is good—it means the system is already clean.)
If you previously set JAVA_HOME manually, you must remove it to prevent errors with the new installation.
Open your bash config:
Find and delete any lines that look like:
Save and Exit: Press Ctrl+O, Enter, then Ctrl+X.
Refresh your session:
Run this command. It should now return "Command 'java' not found".
_________________________________________________________________________________________________________________
For Hadoop 2.10, the most suitable version is Java 8 (OpenJDK 8).
While newer versions of Hadoop (3.3+) support Java 11, the Hadoop 2.x series is most stable with Java 8. Java 7 is also supported but is end-of-life and not recommended.
Here is the step-by-step guide to installing and configuring it on Ubuntu.
Open your terminal (Ctrl+Alt+T) and run the following commands:
sudo apt update
sudo apt install openjdk-8-jdk -y
Once the installation is complete, verify the version to ensure it is active.
java -version
You should see output similar to: openjdk version "1.8.0_xxx"
You will need the exact path for your Hadoop configuration. Run this command:
update-alternatives --list java
Copy the path displayed (excluding /bin/java at the end).
Standard Ubuntu path: /usr/lib/jvm/java-8-openjdk-amd64
You need to set JAVA_HOME so Hadoop can find Java.1
A. Edit your bash configuration:
nano ~/.bashrc
B. Add these lines to the end of the file:
(Replace the path below if yours was different in Step 3)
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
C. Save and Exit:
Press CTRL + O, then Enter (to save).
Press CTRL + X (to exit).
D. Apply the changes:
source ~/.bashrc
Even with the system variable set, Hadoop 2.10 often requires JAVA_HOME to be explicitly defined in its own configuration file.
Navigate to your Hadoop configuration directory (usually $HADOOP_HOME/etc/hadoop/).
Edit hadoop-env.sh:
Bash
nano hadoop-env.sh
Find the line export JAVA_HOME=${JAVA_HOME}.
Change it to the specific path:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
To confirm Java is ready for Hadoop, checking the JAVA_HOME variable should return your path:
echo $JAVA_HOME
import java.util.LinkedList;
import java.util.Scanner;
public class UniversalLinkedList
{
public static void main(String[] args)
{
// 'Object' is the parent of all classes in Java.
// Using <Object> allows this list to store Integers, Strings, Doubles, etc.
LinkedList<Object> anyList = new LinkedList<>();
Scanner sc = new Scanner(System.in);
int choice;
do {
System.out.println("\n--- Universal LinkedList (Accepts Any Type) ---");
System.out.println("1. Add Integer\n2. Add Double\n3. Add String\n4. Add Boolean\n5. Display & Size\n6. Exit");
System.out.print("Enter choice: ");
choice = sc.nextInt();
switch (choice) {
case 1 -> {
System.out.print("Enter an integer: ");
// sc.nextInt() returns a primitive 'int', Java boxes it into an 'Integer' object
anyList.add(sc.nextInt());
}
case 2 -> {
System.out.print("Enter a double (e.g. 10.5): ");
// sc.nextDouble() boxes into a 'Double' object
anyList.add(sc.nextDouble());
}
case 3 -> {
System.out.print("Enter a string: ");
anyList.add(sc.next()); // Strings are already objects
}
case 4 -> {
System.out.print("Enter a boolean (true/false): ");
anyList.add(sc.nextBoolean()); // Boxes into a 'Boolean' object
}
case 5 -> {
System.out.println("List Contents: " + anyList);
System.out.println("Total Items: " + anyList.size());
}
}
} while (choice!= 6);
sc.close();
}
}
A Comprehensive laboratory guide for installing and configuring Hadoop 2.10 on Ubuntu with JDK 8.
Prerequisites & Setup
Mode 1: Standalone Mode (Default, for debugging)
Mode 2: Pseudo-Distributed Mode (Simulated cluster, standard for learning)
Mode 3: Fully Distributed Mode (Real cluster setup)
Web Monitoring
Before starting the modes, ensure the base environment is ready.
1. Verify Java 8
Ensure you have the correct version (as installed in the previous step).
2. Install SSH (Secure Shell)
Hadoop requires SSH to manage its nodes, even on a single machine.
3. Configure Password-less SSH
Hadoop must log in to "localhost" without a password.
4. Download and Extract Hadoop 2.10
5. Configure Environment Variables
Edit your bashrc file:
Add the following to the end:
Save and exit (Ctrl+O, Enter, Ctrl+X), then run: source ~/.bashrc
6. Set JAVA_HOME in Hadoop Config
Edit hadoop-env.sh to hardcode the Java path (Hadoop often struggles to find it otherwise).
Find export JAVA_HOME=${JAVA_HOME} and change it to:
Concept: This is the default mode. Hadoop runs as a single Java process. It is used for debugging and does not use HDFS; it uses the local Linux file system.
Configuration:
No XML configuration is required (default files are empty).
Laboratory Task: Run a "Grep" example to verify Hadoop can calculate.
Create input data:
Run the Hadoop MapReduce job:
This searches for the string dfs in the input files.
View Output:
(If you see a list of text matches, Standalone mode is working.)
Concept: This simulates a full cluster on a single machine. All daemons (NameNode, DataNode, ResourceManager, NodeManager) run as separate Java processes on localhost. This is the standard mode for students.
Configuration Steps:
Navigate to the config folder: cd $HADOOP_HOME/etc/hadoop/
1. core-site.xml (Defines the NameNode address)
Add inside <configuration> tags:
XML
2. hdfs-site.xml (Defines replication factor)
Since we have only one node, replication must be 1.
Add inside <configuration> tags:
XML
3. mapred-site.xml (Defines MapReduce framework)
First, copy the template:
Add inside <configuration> tags:
XML
4. yarn-site.xml (Defines ResourceManager)
Add inside <configuration> tags:
XML
5. Execution
Format the NameNode (Do this only once!):
Start HDFS (NameNode & DataNode):
Start YARN (ResourceManager & NodeManager):
Verify Daemons:
Run jps. You should see 5 key processes: NameNode, DataNode, SecondaryNameNode, ResourceManager, NodeManager.
Concept: This is a real-world cluster involving multiple machines (Nodes). One machine is the Master (NameNode), and others are Slaves (DataNodes).
Laboratory Note: To perform this in a lab, you generally need 2+ Virtual Machines (VMs) networked together (Bridged Adapter). Let's assume you have a Master (IP: 192.168.1.10) and a Slave (IP: 192.168.1.11).
Configuration Changes (Different from Pseudo-mode):
1. Networking (On all nodes)
Edit /etc/hosts on ALL machines so they can ping each other by name.
Plaintext
2. SSH Copy (Master to Slaves)
The Master must be able to SSH into all Slaves without a password.
3. core-site.xml (On Master)
Change localhost to the Master's hostname.
XML
4. Slaves file (On Master)
Edit $HADOOP_HOME/etc/hadoop/slaves.
Remove localhost and add your slave hostnames:
Plaintext
5. Execution
Format NameNode on Master only.
Run start-dfs.sh and start-yarn.sh on Master only.
The script will automatically use SSH to start DataNodes on the slave machines.
Once you have started Hadoop (in Pseudo or Fully Distributed mode), you can monitor the system using your browser.
1. HDFS NameNode UI
URL: http://localhost:50070
Purpose: View the file system, check how much storage is available, and see if your DataNodes are "Live" or "Dead".
Note for Lab: Go to Utilities -> Browse the file system to see your HDFS files graphically.
2. YARN ResourceManager UI
URL: http://localhost:8088
Purpose: Monitor running jobs (applications). If you run a MapReduce job, it will appear here. You can see its progress (0% to 100%) and logs.
3. JobHistory Server (Optional)
URL: http://localhost:19888
Purpose: View details of finished jobs. (Requires starting mr-jobhistory-daemon.sh start historyserver).
Next Step: Would you like a sample WordCount Java program code to run and visualize on this new cluster setup?