EXPLORATORY DATA ANALYSIS
Class: III B.Tech, Semester: V, Sec: A&B, Year : 2025-26 Batch: 2023
Class: III B.Tech, Semester: V, Sec: A&B, Year : 2025-26 Batch: 2023
Each Question carries 5 Marks
Explain the key differences between Exploratory Data Analysis (EDA), classical analysis, and Bayesian analysis. How does the objective of EDA diverge from the other two methods?
Describe the essential steps involved in a typical EDA process. For each step, provide a brief explanation of its purpose and an example of a technique used.
Differentiate between numerical data and categorical data. Provide an example for each and explain why understanding the data type is crucial before performing an analysis.
What are measurement scales in data analysis? List and briefly explain the four primary scales, providing an example of each and how they impact the type of statistical operations that can be performed.
Discuss the significance of EDA in the broader field of data science. Why is it considered a critical first step, and what risks might a data scientist face if this phase is skipped or performed inadequately?
For all problems, you'll need the following libraries. If you don't have them installed, you can use pip install pandas matplotlib seaborn.
Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Set a style for the plots
plt.style.use('ggplot')
Dataset: You have a small dictionary representing the sales of different fruits in a week.
Python
data = {
'Fruit': ['Apples', 'Bananas', 'Oranges', 'Grapes', 'Strawberries'],
'Sales': [150, 220, 180, 95, 110]
}
df_sales = pd.DataFrame(data)
Tasks:
Create a simple bar chart using matplotlib to visualize the sales of each fruit.
Add a title to the chart: "Weekly Fruit Sales".
Label the x-axis "Fruit" and the y-axis "Sales (units)".
Make the bars a single color, for example, a shade of green.
Based on the chart, identify which fruit had the highest sales.
Dataset: A hypothetical dataset of customer ratings for different movie genres.
Python
data = {
'MovieID': range(1, 26),
'Genre': ['Action', 'Comedy', 'Drama', 'Sci-Fi', 'Action', 'Comedy', 'Drama', 'Sci-Fi', 'Action', 'Comedy', 'Drama', 'Sci-Fi', 'Action', 'Comedy', 'Drama', 'Sci-Fi', 'Action', 'Comedy', 'Drama', 'Sci-Fi', 'Action', 'Comedy', 'Drama', 'Sci-Fi', 'Action'],
'Rating': [4.5, 3.8, 5.0, 4.2, 4.8, 4.0, 4.5, 4.1, 4.7, 3.9, 4.6, 4.3, 4.9, 3.7, 4.8, 4.0, 4.6, 4.1, 4.7, 4.2, 4.8, 3.8, 4.9, 4.4, 4.7]
}
df_movies = pd.DataFrame(data)
Tasks:
Calculate the average rating for each movie genre. You'll need to group the data by 'Genre' and then compute the mean of the 'Rating'.
Use seaborn.barplot to create a bar chart of the average ratings per genre.
Set the title to "Average Movie Ratings by Genre".
Label the axes appropriately: "Genre" and "Average Rating".
Order the bars in descending order of average rating.
Interpretation: Which genre has the highest average rating? How does seaborn.barplot handle the aggregation step?
Dataset: A survey of a company's employees showing their satisfaction level by department.
Python
data = {
'Department': ['IT', 'HR', 'Sales', 'Marketing', 'IT', 'HR', 'Sales', 'Marketing', 'IT', 'HR', 'Sales', 'Marketing', 'IT', 'HR', 'Sales', 'Marketing', 'IT', 'HR', 'Sales', 'Marketing'],
'Satisfaction': ['Satisfied', 'Neutral', 'Satisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Satisfied', 'Satisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Neutral', 'Satisfied', 'Satisfied', 'Dissatisfied', 'Satisfied', 'Satisfied', 'Satisfied', 'Satisfied', 'Satisfied']
}
df_survey = pd.DataFrame(data)
Tasks:
First, you need to count the occurrences of each satisfaction level within each department. Use pd.crosstab or groupby().value_counts() to create a pivot table or a similar structure.
Create a stacked bar chart to visualize the distribution of satisfaction levels across the different departments. The x-axis should be 'Department', and the bars should be stacked by 'Satisfaction' level.
Add a clear title like "Employee Satisfaction by Department".
Ensure the legend is visible and correctly labeled.
Analysis: Which department has the highest proportion of 'Dissatisfied' employees? Which department appears to have the most satisfied employees overall?
Dataset: Data on the population of the largest cities in a country.
Python
data = {
'City': ['Tokyo', 'Delhi', 'Shanghai', 'Sao Paulo', 'Mexico City', 'Cairo', 'Mumbai', 'Beijing'],
'Population_Millions': [37.4, 30.3, 27.8, 22.0, 21.6, 20.4, 20.2, 20.1]
}
df_cities = pd.DataFrame(data)
Tasks:
Create a horizontal bar chart to display the population of these cities. Horizontal bars are often better for long category names.
Sort the data by population in descending order before plotting. This makes the chart easier to read.
Add the exact population number (in millions) as a text label at the end of each bar. You'll need to loop through the bars (or use ax.text).
Set an appropriate title and axis labels.
Challenge: Create a seaborn plot, but add the labels using matplotlib's ax.text to show how these libraries can be combined.
Dataset: Sales data for two different product categories (Electronics, Clothing) over three different quarters.
Python
data = {
'Quarter': ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3'],
'Category': ['Electronics', 'Electronics', 'Electronics', 'Clothing', 'Clothing', 'Clothing'],
'Sales_Units': [1200, 1500, 1800, 900, 1100, 1300]
}
df_quarterly_sales = pd.DataFrame(data)
Tasks:
Create a grouped bar chart using seaborn.catplot or seaborn.barplot.
The x-axis should represent the 'Quarter'.
The different product categories ('Electronics', 'Clothing') should be represented by separate, adjacent bars within each quarter group. This is the "grouped" part.
Add a title and axis labels.
Add a legend to distinguish between the two product categories.
Analysis: In which quarter was the performance gap between the two categories the largest? In which quarter did both categories perform best?
Pandas Exercises [W3Schools]
Exercise 1: Column Addition (Elementwise)
Problem: Create a DataFrame with two numeric columns and add them elementwise to create a new column.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'Sales_Q1': [100, 150, 120, 180],
'Sales_Q2': [110, 160, 130, 190]
}
df = pd.DataFrame(data)
# Add 'Sales_Q1' and 'Sales_Q2' to get 'Total_Sales'
df['Total_Sales'] = df['Sales_Q1'] + df['Sales_Q2']
# Print the DataFrame
print("Exercise 1: Total Sales Column")
print(df)
Exercise 2: Column Subtraction and Percentage Calculation
Problem: Create a DataFrame with 'Initial_Price' and 'Discount_Amount'. Calculate the 'Final_Price' and then the 'Discount_Percentage'.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'Item': ['A', 'B', 'C', 'D'],
'Initial_Price': [50, 120, 75, 200],
'Discount_Amount': [5, 12, 0, 40]
}
df = pd.DataFrame(data)
# Calculate Final_Price
df['Final_Price'] = df['Initial_Price'] - df['Discount_Amount']
# Calculate Discount_Percentage (handle division by zero if Initial_Price can be 0)
# Using .apply() with a lambda for conditional calculation
df['Discount_Percentage'] = df.apply(
lambda row: (row['Discount_Amount'] / row['Initial_Price']) * 100 if row['Initial_Price'] > 0 else 0,
axis=1
)
# Print the DataFrame
print("\nExercise 2: Price and Discount Calculations")
print(df)
Exercise 3: Scalar Multiplication
Problem: Create a DataFrame with a 'Quantity' column and multiply all values in it by a scalar (e.g., 2) to represent 'Double_Quantity'.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'Product': ['Pen', 'Notebook', 'Eraser'],
'Quantity': [10, 25, 50]
}
df = pd.DataFrame(data)
# Multiply 'Quantity' by 2
df['Double_Quantity'] = df['Quantity'] * 2
# Print the DataFrame
print("\nExercise 3: Scalar Multiplication")
print(df)
Exercise 4: Applying a Custom Function to a Column (.apply())
Problem: Create a DataFrame with a 'Score' column. Use the .apply() method to create a new 'Status' column based on the 'Score': 'Pass' if score >= 70, 'Fail' otherwise.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'Student': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 60, 72, 90]
}
df = pd.DataFrame(data)
# Define a function to determine pass/fail status
def get_status(score):
if score >= 70:
return 'Pass'
else:
return 'Fail'
# Apply the function to the 'Score' column
df['Status'] = df['Score'].apply(get_status)
# Print the DataFrame
print("\nExercise 4: Applying a Custom Function (Pass/Fail)")
print(df)
Exercise 5: Applying a Lambda Function to a Column
Problem: Create a DataFrame with a 'Price' column. Use a lambda function with .apply() to calculate 'Price_with_Tax' (add 10% tax).
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'Item': ['TV', 'Phone', 'Tablet'],
'Price': [500, 800, 300]
}
df = pd.DataFrame(data)
# Calculate 'Price_with_Tax' using a lambda function
df['Price_with_Tax'] = df['Price'].apply(lambda x: x * 1.10)
# Print the DataFrame
print("\nExercise 5: Applying a Lambda Function (Price with Tax)")
print(df)
Exercise 6: Applying a Function Row-wise (.apply(axis=1))
Problem: Create a DataFrame with 'First_Name' and 'Last_Name' columns. Use .apply(axis=1) to combine them into a 'Full_Name' column.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'First_Name': ['John', 'Jane', 'Peter'],
'Last_Name': ['Doe', 'Smith', 'Jones']
}
df = pd.DataFrame(data)
# Define a function to combine names
def combine_names(row):
return f"{row['First_Name']} {row['Last_Name']}"
# Apply the function row-wise
df['Full_Name'] = df.apply(combine_names, axis=1)
# Print the DataFrame
print("\nExercise 6: Applying Function Row-wise (Full Name)")
print(df)
Exercise 7: Using map() for Value Replacement
Problem: Create a DataFrame with a 'Gender' column (e.g., 'M', 'F'). Use the .map() method to replace these abbreviations with full words ('Male', 'Female').
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Gender': ['F', 'M', 'M', 'F']
}
df = pd.DataFrame(data)
# Define a mapping dictionary
gender_map = {'M': 'Male', 'F': 'Female'}
# Apply the mapping to the 'Gender' column
df['Gender_Full'] = df['Gender'].map(gender_map)
# Print the DataFrame
print("\nExercise 7: Mapping Values (Gender)")
print(df)
Exercise 8: Using replace() for Multiple Value Replacements
Problem: Create a DataFrame with a 'Status' column containing values like 'Active', 'Inactive', 'Pending'. Use .replace() to change 'Active' to 'Online' and 'Inactive' to 'Offline'.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'User': ['U1', 'U2', 'U3', 'U4'],
'Status': ['Active', 'Inactive', 'Pending', 'Active']
}
df = pd.DataFrame(data)
# Define values to replace and their new values
replacements = {'Active': 'Online', 'Inactive': 'Offline'}
# Apply the replacements to the 'Status' column
df['Status_Updated'] = df['Status'].replace(replacements)
# Print the DataFrame
print("\nExercise 8: Replacing Multiple Values (Status)")
print(df)
Exercise 9: Conditional Arithmetic using np.where()
Problem: Create a DataFrame with 'Units_Sold' and 'Price_Per_Unit'. If 'Units_Sold' is greater than 50, apply a 10% discount to the 'Price_Per_Unit' for calculating 'Revenue'. Otherwise, use the original 'Price_Per_Unit'.
Solution:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Product': ['X', 'Y', 'Z', 'W'],
'Units_Sold': [60, 45, 70, 30],
'Price_Per_Unit': [10, 15, 8, 20]
}
df = pd.DataFrame(data)
# Calculate 'Revenue' with a conditional discount using np.where
df['Revenue'] = np.where(
df['Units_Sold'] > 50,
df['Units_Sold'] * (df['Price_Per_Unit'] * 0.90), # 10% discount
df['Units_Sold'] * df['Price_Per_Unit'] # No discount
)
# Print the DataFrame
print("\nExercise 9: Conditional Arithmetic (Revenue with Discount)")
print(df)
Exercise 10: Using clip() to Cap Values
Problem: Create a DataFrame with a 'Temperature' column. Ensure that all temperatures are within a certain range (e.g., between 10 and 30 degrees Celsius) by clipping values outside this range.
Solution:
import pandas as pd
# Create a sample DataFrame
data = {
'City': ['Delhi', 'Mumbai', 'Chennai', 'Kolkata', 'Bengaluru'],
'Temperature': [35, 28, 40, 32, 22] # in Celsius
}
df = pd.DataFrame(data)
# Define the lower and upper bounds for clipping
lower_bound = 20
upper_bound = 35
# Clip the 'Temperature' column
df['Temperature_Clipped'] = df['Temperature'].clip(lower=lower_bound, upper=upper_bound)
# Print the DataFrame
print("\nExercise 10: Clipping Values (Temperature)")
print(df)
In the context of data science, covariance is a crucial statistical measure that helps us understand the relationship between two variables within a dataset. While the fundamental definition remains the same (how two variables change together), its applications and interpretations in data science are specific and highly valuable.
1. Understanding Relationships between Features:
Feature Engineering: When building predictive models, data scientists often analyze the covariance between different features (variables) in their dataset.
Positive Covariance: If two features (e.g., "hours studied" and "exam score") have a positive covariance, it suggests they tend to increase or decrease together. This might indicate that one feature could be a good predictor of the other, or they might be capturing similar underlying information.
Negative Covariance: If two features (e.g., "price of a product" and "sales volume") have a negative covariance, it suggests an inverse relationship. As one increases, the other tends to decrease. This relationship can be vital for business decisions.
Near-Zero Covariance: A covariance close to zero suggests no strong linear relationship. While it doesn't rule out non-linear relationships, it indicates that a simple linear model might not capture how these variables interact.
2. Covariance Matrix:
For datasets with multiple features, a covariance matrix is a powerful tool. It's a square matrix where:
The diagonal elements represent the variance of each individual feature (covariance of a variable with itself).
The off-diagonal elements represent the covariance between each pair of features.
The covariance matrix provides a comprehensive overview of the linear relationships among all features in a dataset. It's symmetric, meaning Cov(X,Y)=Cov(Y,X).
3. Dimensionality Reduction (e.g., Principal Component Analysis - PCA):
This is one of the most significant applications of covariance in data science. PCA is a technique used to reduce the number of features (dimensions) in a dataset while retaining as much variance (information) as possible.
How it works: PCA heavily relies on the covariance matrix. It identifies the principal components (new, uncorrelated variables) that capture the maximum variance in the data. These principal components are essentially the eigenvectors of the covariance matrix, and their corresponding eigenvalues represent the amount of variance explained by each component.
By projecting the data onto a smaller number of principal components, data scientists can simplify complex datasets, reduce computational costs, and often improve model performance by removing noise and multicollinearity.
4. Feature Selection: While correlation is often preferred for direct feature selection due to its normalized nature, understanding covariance can still inform decisions. If two features have very high covariance, it might suggest redundancy. Keeping both might lead to multicollinearity issues in some models, making it beneficial to select one or combine them.
5. Outlier Detection: Covariance (and related concepts like Mahalanobis distance, which uses the covariance matrix) can be used in some outlier detection techniques. Outliers can significantly influence covariance calculations, and unusual covariance patterns can sometimes highlight anomalous data points.
6. Basis for Correlation:
It's important to remember that correlation is a normalized version of covariance. While covariance's magnitude is hard to interpret because it's scale-dependent, correlation standardizes this relationship to a value between -1 and +1, making it universally interpretable for the strength and direction of a linear relationship.
In data science, you'll often see data scientists use correlation much more frequently than raw covariance, especially when trying to understand the strength of relationships. However, covariance is the underlying building block.
Limitations in Data Science:
Scale Dependence: As mentioned before, the numerical value of covariance depends on the units of the variables. This makes it difficult to compare covariances across different pairs of variables with different scales.
Only Linear Relationships: Covariance only captures linear relationships. If two variables have a strong non-linear relationship (e.g., parabolic), their covariance could still be zero or close to zero, misleading you into thinking there's no relationship.
Outlier Sensitivity: Covariance is sensitive to outliers. A few extreme data points can heavily skew the covariance value.
Array Creation:
Create a 1D NumPy array arr1 containing the integers from 0 to 9.
Create a 2D NumPy array arr2 (3x3) filled with all True values.
Create a 2D NumPy array arr3 (2x4) with all elements set to 5.
arr1 = np.arange(10)
print("arr1:", arr1)
arr2 = np.full((3, 3), True, dtype=bool)
print("arr2:\n", arr2)
arr3 = np.full((2, 4), 5)
print("arr3:\n", arr3)
Array Properties:
Given arr1 = np.array([1, 2, 3, 4, 5]), print its data type, shape, and number of dimensions.
arr1_prop = np.array([1, 2, 3, 4, 5])
print("Data type of arr1_prop:", arr1_prop.dtype)
print("Shape of arr1_prop:", arr1_prop.shape)
print("Number of dimensions of arr1_prop:", arr1_prop.ndim)
Indexing and Slicing:
Given arr = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90]), extract elements from index 2 to 6 (inclusive of 2, exclusive of 6).
Given arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), extract the element at row 1, column 2.
From the same arr, extract the first column.
arr_slice = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
extracted_slice = arr_slice[2:6]
print("Extracted slice:", extracted_slice)
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
element = arr_2d[1, 2] # Row 1, Column 2 (0-indexed)
print("Element at [1, 2]:", element)
first_column = arr_2d[:, 0]
print("First column:\n", first_column)
Basic Operations:
Create two arrays, a = np.array([1, 2, 3]) and b = np.array([4, 5, 6]). Perform element-wise addition, subtraction, multiplication, and division.
Calculate the sum of all elements in a.
Find the maximum value in b.
print("\n--- Problem 4 ---")
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("a + b:", a + b)
print("a - b:", a - b)
print("a * b:", a * b)
print("a / b:", a / b)
print("Sum of a:", np.sum(a))
print("Max of b:", np.max(b))
Boolean Indexing:
Given arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), extract all odd numbers.
Replace all even numbers in arr with -1.
arr_bool = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
odd_numbers = arr_bool[arr_bool % 2 != 0]
print("Odd numbers:", odd_numbers)
arr_bool[arr_bool % 2 == 0] = -1
print("Array with even numbers replaced:", arr_bool)
Reshaping:
Given arr = np.arange(12), reshape it into a 4x3 2D array.
Reshape the 4x3 array back into a 1D array.
arr_reshape = np.arange(12)
reshaped_arr = arr_reshape.reshape(4, 3)
print("Reshaped (4x3):\n", reshaped_arr)
flattened_arr = reshaped_arr.reshape(-1) # or .flatten()
print("Flattened:", flattened_arr)
Concatenation and Splitting:
Create arr1 = np.array([1, 2, 3]) and arr2 = np.array([4, 5, 6]). Concatenate them horizontally.
Create arr3 = np.array([[1, 2], [3, 4]]) and arr4 = np.array([[5, 6], [7, 8]]). Concatenate them vertically.
Given arr = np.arange(9).reshape(3, 3), split it into three equal-sized sub-arrays horizontally.
arr1_cat = np.array([1, 2, 3])
arr2_cat = np.array([4, 5, 6])
horizontal_concat = np.hstack((arr1_cat, arr2_cat))
print("Horizontal concatenation:", horizontal_concat)
arr3_cat = np.array([[1, 2], [3, 4]])
arr4_cat = np.array([[5, 6], [7, 8]])
vertical_concat = np.vstack((arr3_cat, arr4_cat))
print("Vertical concatenation:\n", vertical_concat)
arr_split = np.arange(9).reshape(3, 3)
split_arrays = np.hsplit(arr_split, 3)
print("Original array for splitting:\n", arr_split)
print("Split arrays:")
for arr in split_arrays:
print(arr)
Broadcasting:
Given arr = np.array([[1, 2, 3], [4, 5, 6]]), add 10 to every element without using a loop.
Given arr = np.array([[1, 2, 3], [4, 5, 6]]) and col = np.array([10, 20]), add col to each row of arr (think about reshaping col if necessary).
arr_broad = np.array([[1, 2, 3], [4, 5, 6]])
added_ten = arr_broad + 10
print("Array + 10:\n", added_ten)
col = np.array([10, 20])
# Reshape col to be a column vector for broadcasting across rows
added_col = arr_broad + col.reshape(-1, 1)
print("Array + column vector:\n", added_col)
Statistical Operations:
Given arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), calculate the mean of each column.
Calculate the standard deviation of the entire array.
Find the minimum value along axis 0 (columns).
arr_stats = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
column_means = np.mean(arr_stats, axis=0)
print("Mean of each column:", column_means)
std_dev = np.std(arr_stats)
print("Standard deviation of entire array:", std_dev)
min_along_axis0 = np.min(arr_stats, axis=0)
print("Minimum along axis 0 (columns):", min_along_axis0)
Unique Elements and Counts:
Given arr = np.array([1, 2, 1, 3, 2, 4, 5, 4, 1]), find the unique elements.
Count the occurrences of each unique element.
arr_unique = np.array([1, 2, 1, 3, 2, 4, 5, 4, 1])
unique_elements = np.unique(arr_unique)
print("Unique elements:", unique_elements)
unique_elements, counts = np.unique(arr_unique, return_counts=True)
print("Unique elements and their counts:", dict(zip(unique_elements, counts)))
Conditional Operations (where):
Given arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), replace all numbers greater than 5 with 100, and all other numbers with 0.
arr_cond = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
result_cond = np.where(arr_cond > 5, 100, 0)
print("Conditional replacement:", result_cond)
Saving and Loading:
Create an array data = np.arange(100).reshape(10, 10). Save this array to a file named 'my_array.npy'.
Load the 'my_array.npy' file back into a new variable loaded_data. Verify they are identical.
data_save = np.arange(100).reshape(10, 10)
file_name = 'my_array.npy'
np.save(file_name, data_save)
print(f"Array saved to {file_name}")
loaded_data = np.load(file_name)
print("Array loaded from file.")
print("Are original and loaded arrays identical?", np.array_equal(data_save, loaded_data))
Introduction
Creating your own dataset is crucial in many data science and machine learning projects. While there are numerous publicly available datasets, building your own dataset allows you to tailor it to your specific needs and ensure its quality. Further in this article, you will explore the importance of custom datasets and provide a step-by-step guide on creating your own dataset in Python. We will also discuss data augmentation and expansion techniques, tools and libraries for dataset creation, best practices for creating high-quality datasets, and ethical considerations in dataset creation.
Learn Python syntax • Variables & data types • Loops & functions
Understanding the Importance of Custom Datasets: Custom datasets offer several advantages over pre-existing datasets.
They allow you to define the purpose and scope of your dataset according to your specific project requirements. This level of customization ensures that your dataset contains the relevant data needed to address your research questions or solve a particular problem.
Custom datasets provide you with control over the data collection process. You can choose the sources from which you gather data, ensuring its authenticity and relevance. This control also extends to the data cleaning and preprocessing steps, allowing you to tailor them to your needs.
Custom datasets enable you to address any class imbalance issues in pre-existing datasets. By collecting and labeling your own data, you can ensure a balanced distribution of classes, which is crucial for training accurate machine learning models.
Steps to Create Your Own Dataset in Python
Creating your own dataset involves several key steps. Let’s explore each step in detail:
Defining the Purpose and Scope of Your Dataset
Before gathering any data, it is essential to define the purpose and scope of your dataset clearly. Ask yourself what specific problem you are trying to solve or what research questions you are trying to answer. This clarity will guide you in determining the types of data you need to collect and the sources from which you should gather them.
Gathering and Preparing the Data
Once you have defined the purpose and scope of your dataset, you can start gathering the data. Depending on your project, you may collect data from various sources such as APIs, web scraping, or manual data entry. It is crucial to ensure the authenticity and integrity of the data during the collection process.
After gathering the data, you need to prepare it for further processing. This step involves converting the data into a suitable format for analysis, such as CSV or JSON. Additionally, you may need to perform initial data-cleaning tasks, such as removing duplicates or irrelevant data points.
Cleaning and Preprocessing the Data
Data cleaning and preprocessing are essential steps in dataset creation. This process involves handling missing data, dealing with outliers, and transforming the data into a suitable format for analysis. Python provides various libraries, such as Pandas and NumPy, with powerful data cleaning and preprocessing tools.
For example, if your dataset contains missing values, you can use the Pandas library to fill in those missing values with appropriate imputation techniques. Similarly, if your dataset contains outliers, you can use statistical methods to detect and handle them effectively.
Organizing and Structuring the Dataset
To ensure the usability and maintainability of your dataset, it is crucial to organize and structure it properly. This step involves creating a clear folder structure, naming conventions, and file formats that facilitate easy access and understanding of the data.
For example, you can organize your dataset into separate folders for different classes or categories. Each file within these folders can represent a single data instance with a standardized naming convention that includes relevant information about the data.
Splitting the Dataset into Training and Testing Sets
Splitting your dataset into training and testing sets is essential to evaluate the performance of machine learning models. The training set is used to train the model, while the testing set assesses its performance on unseen data.
Python’s scikit-learn library provides convenient functions for splitting datasets into training and testing sets. For example, you can use the `train_test_split` function to divide your dataset into the desired proportions randomly.
Handling Imbalanced Classes (if applicable)
If your dataset contains imbalanced classes, where some classes have significantly fewer instances than others, it is crucial to address this issue. Imbalanced classes can lead to biased models that perform poorly on underrepresented classes.
There are several techniques to handle imbalanced classes, such as oversampling, undersampling, or using advanced algorithms specifically designed for imbalanced datasets. Python libraries like imbalanced-learn implement these techniques that can be easily integrated into your dataset creation pipeline.
Also read: Top 50+ Geospatial Python Libraries
Techniques for Data Augmentation and Expansion
Data augmentation is a powerful technique used to increase the size and diversity of your dataset. It involves applying various transformations to the existing data, creating new instances that are still representative of the original data.
Image Data Augmentation
Image data augmentation is commonly used to improve model performance in computer vision tasks. Techniques such as rotation, flipping, scaling, and adding noise can be applied to images to create new variations of the original data.
Python libraries like OpenCV and imgaug provide various functions and methods for image data augmentation. For example, you can use the `rotate` function from the OpenCV library to rotate images by a specified angle.
import cv2
image = cv2.imread('image.jpg')
rotated_image = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)
Text Data Augmentation
Text data augmentation generates new text instances by applying various transformations to the existing text. Techniques such as synonym replacement, word insertion, and word deletion can create diverse variations of the original text.
Python libraries like NLTK and TextBlob provide functions and methods for text data augmentation. For example, you can use the `synsets` function from the NLTK library to find synonyms of words and replace them in the text.
from nltk.corpus import wordnet
def synonym_replacement(text):
words = text.split()
augmented_text = []
for word in words:
synonyms = wordnet.synsets(word)
if synonyms:
augmented_text.append(synonyms[0].lemmas()[0].name())
else:
augmented_text.append(word)
return ' '.join(augmented_text)
original_text = "The quick brown fox jumps over the lazy dog."
augmented_text = synonym_replacement(original_text)
Audio Data Augmentation
Data augmentation techniques can be applied to audio signals in audio processing tasks to create new instances. Techniques such as time stretching, pitch shifting, and adding background noise can generate diverse variations of the original audio data.
Python libraries like Librosa and PyDub provide functions and methods for audio data augmentation. For example, you can use the `time_stretch` function from the Librosa library to stretch the duration of an audio signal.
import librosa
audio, sr = librosa.load('audio.wav')
stretched_audio = librosa.effects.time_stretch(audio, rate=1.2)
Video Data Augmentation
Video data augmentation involves applying transformations to video frames to create new instances. Techniques such as cropping, flipping, and adding visual effects can generate diverse variations of the original video data.
Python libraries like OpenCV and MoviePy provide functions and methods for video data augmentation. For example, you can use the `crop` function from the MoviePy library to crop a video frame.
from moviepy.editor import VideoFileClip
video = VideoFileClip('video.mp4')
cropped_video = video.crop(x1=100, y1=100, x2=500, y2=500)
Tools and Libraries for Dataset Creation in Python
Python offers several tools and libraries that can simplify the dataset-creation process. Let’s explore some of these tools and libraries:
Scikit-learn
Scikit-learn is a popular machine-learning library in Python that provides various functions and classes for dataset creation. It offers functions for generating synthetic datasets, splitting datasets into training and testing sets, and handling imbalanced classes.
For example, you can use the `make_classification` function from the `sklearn.datasets` module to generate a synthetic classification dataset.
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
Hugging Face Datasets
Hugging Face Datasets is a Python library that provides a wide range of pre-existing datasets for natural language processing tasks. It also offers tools for creating custom datasets by combining and preprocessing existing datasets.
For example, you can use the `load_dataset` function from the `datasets` module to load a pre-existing dataset.
from datasets import load_dataset
dataset = load_dataset('imdb')
Kili Technology
Kili Technology is a data labeling platform that offers tools for creating and managing datasets for machine learning projects. It provides a user-friendly interface for labeling data and supports various data types, including text, images, and audio.
Using Kili Technology, you can easily create labeled datasets by inviting collaborators to annotate the data or by using their built-in annotation tools.
Other Python Libraries for Dataset Creation
Apart from the aforementioned tools and libraries, several other Python libraries can be useful for dataset creation. Some of these libraries include Pandas, NumPy, TensorFlow, and PyTorch. These libraries offer powerful data manipulation, preprocessing, and storage tools, making them essential for dataset creation.
Best Practices for Creating High-Quality Datasets
Creating high-quality datasets is crucial for obtaining accurate and reliable results in data science and machine learning projects. Here are some best practices to consider when creating your own dataset:
Ensuring Data Quality and Integrity
Data quality and integrity are paramount in dataset creation. Ensuring that the data you collect is accurate, complete, and representative of the real-world phenomenon you study is essential. This can be achieved by carefully selecting data sources, validating the data during the collection process, and performing thorough data cleaning and preprocessing.
Handling Missing Data
Missing data is a common issue in datasets and can significantly impact the performance of machine learning models. It is important to handle missing data appropriately by using imputation techniques or using advanced algorithms that can handle missing values.
Dealing with Outliers
Outliers are data points that deviate significantly from the rest of the data. They can disproportionately impact the results of data analysis and machine learning models. It is crucial to detect and handle outliers effectively by using statistical methods or considering the use of robust algorithms that are less sensitive to outliers.
Balancing Class Distribution
If your dataset contains imbalanced classes, it is important to address this issue to prevent biased models. Techniques such as oversampling, undersampling, or using advanced algorithms specifically designed for imbalanced datasets can be used to balance the class distribution.
Documenting and Annotating the Dataset
Proper documentation and annotation of the dataset are essential for its usability and reproducibility. Documenting the data sources, collection methods, preprocessing steps, and any assumptions made during the dataset creation process ensures transparency and allows others to understand and reproduce your work.
Ethical Considerations in Dataset Creation
Dataset creation also involves ethical considerations that should not be overlooked. Here are some key ethical considerations to keep in mind:
Privacy and Anonymization
When collecting and using data, it is important to respect privacy and ensure the anonymity of individuals or entities involved. This can be achieved by removing or encrypting personally identifiable information (PII) from the dataset or obtaining proper consent from individuals.
Bias and Fairness
Bias in datasets can lead to biased models and unfair outcomes. It is crucial to identify and mitigate any biases present in the dataset, such as gender or racial biases. This can be done by carefully selecting data sources, diversifying the data collection process, and using fairness-aware algorithms.
Informed Consent and Data Usage Policies
Obtaining informed consent from individuals whose data is being collected is essential. Individuals should be fully informed about the purpose of data collection, how their data will be used, and any potential risks involved. Additionally, clear data usage policies should be established to ensure responsible and ethical use of the dataset.
Conclusion
Building your own dataset in Python allows you to customize the data according to your project requirements and ensure its quality. By following the steps outlined in this article, you can create a high-quality dataset that addresses your research questions or solves a specific problem.
Additionally, data augmentation and expansion techniques, tools and libraries for dataset creation, best practices for creating high-quality datasets, and ethical considerations in dataset creation were discussed. With these insights, you are well-equipped to embark on your own dataset creation journey.
(Courtesy: Deep Sandhya shukla - Senior Data Analyst at Incedo | LinkedIn )
Class: III B.Tech , Semester: V, CAY: 2025-26
Course Objectives: The main objectives of the course are to
Introducing the fundamentals of Exploratory Data Analysis.
Cover essential exploration techniques for understanding multivariate data by summarizing it through statistical methods and graphical methods
Evaluate the models and select the best model.
Course Educational Objectives (CEOs)
To provide foundational knowledge in data science concepts, including the role and importance of Exploratory Data Analysis (EDA) in the data science lifecycle.
To equip students with practical skills in data visualization and exploration techniques using modern tools and libraries.
To develop the ability to preprocess and transform data effectively for analysis and modelling, ensuring data quality and readiness.
To enable students to apply statistical methods for summarizing and interpreting data, fostering analytical thinking.
To prepare students to build and evaluate predictive models, enabling them to solve real-world problems using data-driven approaches.
COURSE OUTCOMES (COs): At the end of the course, the student will be able to
CO1: Understand the Fundamentals of EDA and Data Science (Understand, L2)
CO2: Apply Visualization Techniques for Data Exploration (Apply, L3)
CO3: Perform Data Transformation and Preprocessing (Apply, L3)
CO4: Analyze Data Using Descriptive Statistics (Analyze, L4)
CO5: Develop and Evaluate Predictive Model (Apply, L3)
Textbook:
Reference Books:
1. Ronald K. Pearson, Exploratory Data Analysis Using R, CRC Press, 2020
2. Radhika Datar, Harish Garg, Hands-On Exploratory Data Analysis with R: Become an expert in exploration data analysis using R packages, 1st Edition, Packet Publishing, 2019
Web References:
1. https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python
2. https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-dataanalysis-eda-using-python/#h-conclusion
3. https://github.com/PacktPublishing/Exploratory-Data-Analysis-with-Python-Cookbook
Exploratory Data Analysis Fundamentals: Understanding data science, the significance of EDA, Steps in EDA, making sense of data, Numerical data, Categorical data, Measurement scales, Comparing EDA with classical and Bayesian analysis, Software tools available for EDA, getting started with EDA.
Visual Aids for EDA: Technical requirements, Line chart, Bar charts, Scatter plot using seaborn, Polar chart, Histogram, Choosing the best chart. Case Study: EDA with Personal Email, Technical requirements, Loading the dataset, Data transformation, Data cleansing, Applying descriptive statistics, Data refactoring, Data analysis.
Data Transformation: Merging database-style data frames, concatenating along with an axis, merging on index, Reshaping and pivoting, Transformation techniques, handling missing data, Mathematical operations with NaN, Filling missing values, Discretization and binning, Outlier detection and filtering, Permutation and random sampling, Benefits of data transformation, Challenges.
Descriptive Statistics: Distribution function, Measures of central tendency, Measures of dispersion, Types of kurtoses, calculating percentiles, Quartiles, Grouping Datasets, Correlation, Understanding univariate, bivariate, multivariate analysis, Time Series Analysis
Unified machine learning workflow, Data pre-processing, Data preparation, Training sets and corpus creation, Model creation and training, Model evaluation, best model selection and evaluation, Model deployment
Case Study: EDA on Wine Quality Data Analysis
Geospatial EDA (Geo-EDA): Introduction, Data Types & Basic Visualizations
Geo-EDA: Spatial Queries, Aggregations & Advanced Mapping
Interactive Dashboards: Introduction to Plotly & Simple Dash Layouts
Interactive Dashboards: Callbacks, Interactivity & Deployment Basics
Advanced Outlier Detection: Statistical & Proximity-Based Methods
Advanced Outlier Detection: Model-Based Methods & Interpretation