Welcome to Foundation of Data Science Laboratory
Welcome to Foundation of Data Science Laboratory
Create a DataFrame with Some Outliers
Identify Outliers Using the Z-score Method
Remove the Outliers
Display the Cleaned DataFrame
import pandas as pd
import numpy as np
from scipy import stats
# Step 1: Create a DataFrame with some outliers
data = {
'A': [10, 12, 13, 14, 15, 100, 16, 17, 18, 19], # 100 is an outlier
'B': [20, 22, 23, 24, 25, 30, 26, 27, 28, 29] # No outliers here
}
df = pd.DataFrame(data)
# Step 2: Identify outliers using the Z-score method
# Calculate Z-scores for each column
z_scores = np.abs(stats.zscore(df))
# Set a threshold for identifying outliers
threshold = 2 # You can adjust this value
outliers = (z_scores > threshold).any(axis=1)
# Step 3: Remove the outliers
cleaned_df = df[~outliers]
# Step 4: Display the cleaned DataFrame
print("Original DataFrame:")
print(df)
print("\nCleaned DataFrame:")
print(cleaned_df)
Original DataFrame:
A B
0 10 20
1 12 22
2 13 23
3 14 24
4 15 25
5 100 30
6 16 26
7 17 27
8 18 28
9 19 29
Cleaned DataFrame:
A B
0 10 20
1 12 22
2 13 23
3 14 24
4 15 25
6 16 26
7 17 27
8 18 28
9 19 29
Creating the DataFrame:
A DataFrame df is created with a column A containing an outlier (100) and a column B with normal values.
Calculating Z-scores:
The Z-score for each value in the DataFrame is calculated using scipy.stats.zscore.
np.abs(stats.zscore(df)) gives the absolute Z-scores.
Identifying Outliers:
Outliers are identified by checking if any Z-score in a row exceeds the threshold. Rows where this is true are marked as outliers.
Removing Outliers:
The DataFrame is filtered to exclude rows identified as outliers.
Displaying Data:
The original and cleaned DataFrames are displayed.