Alpana A. Borse

Welcome to Foundation of Data Science Laboratory

2.3 Data Cleaning and Preprocessing

1. Handling Missing Values:

Step-by-Step Python Code

Create a DataFrame with Some Outliers
Identify Outliers Using the Z-score Method
Remove the Outliers
Display the Cleaned DataFrame

import pandas as pd

import numpy as np

from scipy import stats

# Step 1: Create a DataFrame with some outliers

data = {

'A': [10, 12, 13, 14, 15, 100, 16, 17, 18, 19], # 100 is an outlier

'B': [20, 22, 23, 24, 25, 30, 26, 27, 28, 29] # No outliers here

}

df = pd.DataFrame(data)

# Step 2: Identify outliers using the Z-score method

# Calculate Z-scores for each column

z_scores = np.abs(stats.zscore(df))

# Set a threshold for identifying outliers

threshold = 2 # You can adjust this value

outliers = (z_scores > threshold).any(axis=1)

# Step 3: Remove the outliers

cleaned_df = df[~outliers]

# Step 4: Display the cleaned DataFrame

print("Original DataFrame:")

print(df)

print("\nCleaned DataFrame:")

print(cleaned_df)

Expected Output

Original DataFrame:

A B

0 10 20

1 12 22

2 13 23

3 14 24

4 15 25

5 100 30

6 16 26

7 17 27

8 18 28

9 19 29

Cleaned DataFrame:

A B