Welcome to Foundation of Data Science Laboratory
Welcome to Foundation of Data Science Laboratory
Here's a complete Python code example that demonstrates how to handle outliers using the Z-score method. This includes creating a DataFrame with some outliers, identifying the outliers using Z-scores, removing them, and displaying the cleaned DataFrame.
import pandas as pd
import numpy as np
from scipy import stats
# Step 1: Create a DataFrame with some outliers
data = {
'Value1': [10, 12, 14, 15, 16, 17, 18, 19, 20, 150], # 150 is an outlier
'Value2': [20, 22, 21, 23, 24, 25, 24, 23, 22, 100] # 100 is an outlier
}
df = pd.DataFrame(data)
# Step 2: Identify outliers using the Z-score method
# Calculate Z-scores for each column
z_scores = np.abs(stats.zscore(df))
# Set a threshold for identifying outliers
threshold = 2 # Z-score threshold for outliers
outliers = (z_scores > threshold).any(axis=1)
# Step 3: Remove the outliers
cleaned_df = df[~outliers]
# Step 4: Display the cleaned DataFrame
print("Original DataFrame:")
print(df)
print("\nCleaned DataFrame:")
print(cleaned_df)
Original DataFrame:
Value1 Value2
0 10 20
1 12 22
2 14 21
3 15 23
4 16 24
5 17 25
6 18 24
7 19 23
8 20 22
9 150 100
Cleaned DataFrame:
Value1 Value2
0 10 20
1 12 22
2 14 21
3 15 23
4 16 24
5 17 25
6 18 24
7 19 23
8 20 22
Creating the DataFrame:
A DataFrame df is created with columns Value1 and Value2, containing some values with outliers (150 and 100).
Calculating Z-scores:
The Z-score for each value in the DataFrame is calculated using scipy.stats.zscore, which standardizes the data.
Identifying Outliers:
A threshold of 2 is used to identify outliers. Any value with a Z-score greater than this threshold is considered an outlier.
np.abs(stats.zscore(df)) gives the absolute Z-scores. The condition (z_scores > threshold).any(axis=1) checks if any Z-score in a row exceeds the threshold.
Removing Outliers:
The DataFrame is filtered to exclude rows identified as outliers using boolean indexing.
Displaying Data:
The original and cleaned DataFrames are printed to show the effect of outlier removal.