Missing data, or missing values:- occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.
sns.distplot(df['age'])
df['age_mean']=df['age'].fillna(df.age.mean())
df[['age_mean','age']]
df['age_median']=df['age'].fillna(df.age.median())
mode=df[df['age'].notna()]['embarked'].mode()[0]
df['embarked_mode']=df['embarked'].fillna(mode)
imbalanced dataset -: it is one in which the distribution of classes is unequal, meaning some classes are underrepresented compared to others.
For example:
50%-50% is balanced,
60%-40% is slightly imbalanced,
70%-30% (or beyond) is imbalanced.
techniques to handle imbalanced datasets:
1. Up-Sampling (Over-Sampling): Increases the number of instances in the minority class by duplicating or generating synthetic samples, aiming to balance class distribution. For example, if the minority class has fewer instances, we can duplicate or generate new samples to match the majority class.
from sklearn.utils import resample
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=len(df_majority),# match number in majority class
random_state=42
)
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
2. Down-Sampling (Under-Sampling): Reduces the number of instances in the majority class by randomly removing samples, which balances the classes by decreasing the majority class size. For instance, if the majority class has many instances, we can randomly drop some to match the minority class size.
from sklearn.utils import resample
df1_majority_upsampled = resample(df1_majority,
replace=False, # sample with replacement
n_samples=len(df1_minority),# match number in majority class
random_state=42
)
df1_downsampled = pd.concat([df1_minority, df1_majority_upsampled])
3. SMOTE (Synthetic Minority Over-sampling Technique): A popular method that generates synthetic samples for the minority class by interpolating between existing minority samples, instead of simply duplicating them, making the dataset more balanced and less prone to overfitting compared to naive up-sampling.
from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X_over,y_over=oversample.fit_resample(final_df[['feature1','feature2']],final_df['target'])
df3=pd.DataFrame(X_over,columns=['feature1','feature2'])
df4=pd.DataFrame(y_over,columns=['target'])
oversample_df=pd.concat([df3,df4],axis=1)
oversample_df
Original Data
Minority Class (Red): ○1 ○2 ○3
Majority Class (Blue): ●4 ●5 ●6 ●7 ●8
UpSampling
Minority Class (Red): ○1 ○2 ○2 ○3 ○3
Majority Class (Blue): ●4 ●5 ●6 ●7 ●8
SMOTE
Minority Class (Red): ○1 ○2 ○(1+2) ○3 ○(2+3)
○1 ○2 ○1.5 ○3 ○2.5
Majority Class (Blue): ●4 ●5 ●6 ●7 ●8
In this data:
- Red circles (○) represent the minority class.
- Blue squares (●) represent the majority class.
- UpSampling repeats existing minority class data points.
- SMOTE generates new synthetic data points between existing minority class data points.
examples with different class distributions to illustrate how up-sampling, down-sampling, and SMOTE can help balance datasets:
1. 50%-50% (Balanced)
- Example: Suppose we have a dataset of 1,000 samples where 500 are labeled as Class A and 500 as Class B.
- Balance Status: Already balanced; no need for up-sampling, down-sampling, or SMOTE.
2. 60%-40% (Slightly Imbalanced)
- Example: In a dataset with 1,000 samples, 600 belong to Class A and 400 to Class B.
- Techniques:
- Up-Sampling: Duplicate or generate synthetic samples for Class B to increase it to 600, matching Class A.
- Down-Sampling: Randomly remove samples from Class A to reduce it to 400, matching Class B.
- SMOTE: Generate synthetic samples for Class B (400 → 600) using interpolation between existing samples, creating a balanced 600-600 distribution.
Data interpolation is the process of estimating unknown values within a dataset based on the known values. In Python, there are various libraries available that can be used for data interpolation, such as NumPy, SciPy, and Pandas. Here is an example of how to perform data interpolation using the NumPy library:
1.Liner Interpolation
Linear interpolation is a method of estimating an unknown value between two known values on a straight line.
import numpy as np
x = np.array([1, 2,3, 4, 5])
y = np.array([2,4,6,8,10])
x_new = np.linspace(x[0],x[-1],10)
y_interp = np.interp(x_new,x,y)
Cubic Interpolation with SciPy
#kind : str or int, optional
Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of 'linear', 'nearest', 'nearest-up', 'zero', 'slinear', 'quadratic', 'cubic', 'previous', or 'next'. 'zero', 'slinear', 'quadratic' and 'cubic' refer to a spline interpolation of zeroth, first, second or third order; 'previous' and 'next' simply return the previous or next value of the point; 'nearest-up' and 'nearest' differ when interpolating half-integers (e.g. 0.5, 1.5) in that 'nearest-up' rounds up and 'nearest' rounds down. Default is 'linear'.
import numpy as np
x = np.array([1, 2,3, 4, 5])
y = np.array([3,8,27,64,125])
from scipy.interpolate import interp1d
f = interp1d(x,y,kind='cubic')
x_new = np.linspace(x[0],x[-1],10)
y_interp = f(x_new)
Polynomial Interpolation
x = np.array([1, 2,3, 4, 5])
y = np.array([1,4,9,16,25])
#Interpoleate the data using Polynomial Interpolation
p=np.polyfit(x,y,2) #fit a 2nd Degree polynomial to the data
x_new = np.linspace(x[0],x[-1],10)
y_interp = np.polyval(p,x_new)
Outliers :-The values that are significantly different from the majority of data points in a dataset.
Minimum Value: The smallest number in the dataset.
Q1 (First Quartile): The value at the 25th percentile. This is the median of the lower half of the data.
Q1=np.percentile(Data,[25])
Median (Q2): The middle value of the dataset when ordered. If there is an even number of observations, it is the average of the two middle numbers.
Q2=np.percentile(Data,[50])
Q3 (Third Quartile): The value at the 75th percentile. This is the median of the upper half of the data.
Q3=np.percentile(Data,[75])
Maximum Value: The largest number in the dataset.
Max=np.percentile(Data,[100])
import numpy as np
list_marks=[42,32,56,75,89,54,32,89,90,87,67,54,45,98,99,67,74,1000,1100]
min,Q1,Q2,Q3,max=np.quantile(list_marks,[0.0,0.25,0.50,0.75,1.0])
IQR=Q3-Q1
Lower_fence=Q1-1.5*IQR
Upper_fence=Q3+1.5*IQR
outliers=[]
non_outliers=[]
for i in list_marks:
if i<Lower_fence or i>Upper_fence:
outliers.append(i)
else:
non_outliers.append(i)
import seaborn as sns
print(sns.boxplot(list_marks))
print(sns.boxplot(non_outliers))
Feature Extraction
Feature Extraction is the process of selecting and extracting the most important features from raw data.
ML Application
→ 1000 features
↓
Most Important features
↓
Machine Learning Algo.
Feature Scaling
Feature scaling is the process of adjusting the values of different features to a common scale, so that no single feature dominates the learning process.
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
k=min_max.fit_transform(df[['distance', 'fare', 'tip']])
import pandas as pd
df1=pd.DataFrame(k,columns=['distance', 'fare', 'tip'])
sns.histplot(df1['distance'])
Standardization (Z-score Normalization):
Scales features to have a mean of 0 and a standard deviation of 1.
The range of z-scores generally depends on the data set, but in standard normal distributions:
Most z-scores typically fall between (-3) and (+3), encompassing about 99.7% of the data (three standard deviations from the mean).
In theory:
Z-scores can extend indefinitely in both positive and negative directions since they represent how many standard deviations a data point is from the mean.
So, the possible range of z-scores is from negative infinity to positive infinity.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(df[['total_bill','tip']])
t=scaler.transform(df[['total_bill','tip']])
import pandas as pd
df1=pd.DataFrame(t,columns=['total_bill','tip'])
sns.histplot(df1['total_bill'])
Unit Vector Scaling (L2 Normalization) :- The unit vector approach in feature scaling, also known as vector normalization or L2 normalization, is a technique where each feature vector is scaled to have a magnitude of 1. This process is common in machine learning, especially when working with algorithms that rely on distance calculations, like k-nearest neighbors or clustering.
from sklearn.preprocessing import normalize
N=normalize(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])
import pandas as pd
df1=pd.DataFrame(N,columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
sns.histplot(df1['sepal_length'])
Principal component analysis (PCA) :- reduces the number of dimensions in large datasets to principal components that retain most of the original information.
Data Encoding :-is the process of converting categorical data into a numerical format so that machine learning algorithms can process it.
Mainly Types:-
1. Nominal / OHE (One Hot Encoding)
Description: Used for nominal data (categories with no inherent order), such as "red," "blue," "green."
How it works: Each unique category is represented as a binary vector, where only one bit is "1" (indicating the presence of that category), and all others are "0".
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df=pd.DataFrame({'color':['red','blue','green','green','red','blue']})
encoder=OneHotEncoder()
encoded=encoder.fit_transform(df[['color']])
encoded_df=pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
pd.concat([df,encoded_df],axis=1)
OR
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red', 'blue']})
k = pd.get_dummies(df['color'], prefix='color').astype(int)
k
2. Ordinal :-
Description: Used for ordinal data (categories with an order or ranking), like "low," "medium," "high," or to represent categories with arbitrary numerical labels.
Ordinal Encoding: Assigns each category a unique integer based on its rank. For example, "low" = 1, "medium" = 2, "high" = 3.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df=pd.DataFrame({'color':['red','blue','green','green','red','blue']})
encoder=LabelEncoder()
encoded=encoder.fit_transform(df['color'])
df['encoded_color']=encoded
Label Encoding: Each unique category is assigned a numerical label, but no order is assumed (often used for non-ordinal data when a simple transformation is acceptable). For example, "cat" = 1, "dog" = 2, "mouse" = 3.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df1=pd.DataFrame({'size':['small','medium','large','medium','small','large']})
encoder1=OrdinalEncoder(categories=[['small','medium','large']])
encoded1=encoder1.fit_transform(df1[['size']])
df1['encoded_size']=encoded1
df1
4. Target Guided Ordinal Encoding :-
Description: An encoding technique that assigns values to categories based on their relationship with the target variable in a supervised learning context.
How it works: Categories are ordered based on a statistical property with respect to the target (e.g., mean target value or frequency). For example, if you’re predicting house prices, neighborhoods might be ordered by average price.
import pandas as pd
df = pd.DataFrame({
'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
'price': [200, 150, 300, 250, 180, 320]
})
mean_price=df.groupby('city')['price'].mean().to_dict()
df['city_encoded']=df['city'].map(mean_price)
df
Covariance:- measure of the relationship between two random variables.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('healthexp')
numerical_df = df.select_dtypes(include=['number'])
covariance_matrix = numerical_df.cov()
covariance_matrix
variance:- statistical measurement of the spread between numbers in a data set.
numerical_df = df.select_dtypes(include=['number'])
numerical_df.var()
Pearson correlation coefficient :- it measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a total negative linear correlation, 0 being no correlation, and + 1 meaning a total positive correlation.
numerical_df = df.select_dtypes(include=['number'])
numerical_df.corr(method='pearson')
Spearman's rank correlation coefficient :- measures the strength and direction of association between two ranked variables. It basically gives the measure of monotonicity of the relation between two variables i.e. how well the relationship between two variables could be represented using a monotonic function.
numerical_df = df.select_dtypes(include=['number'])
numerical_df.corr(method='spearman')
Pearson Correlation: Measures the linear relationship between two variables.
Spearman Rank Correlation: Measures the monotonic relationship between two ranked variables.
Variance: Indicates how much individual data points deviate from the mean.
Correlation: Measures the strength and direction of the relationship between two variables.
Exploratory Data Analysis (EDA): Analyzing data sets to summarize their main characteristics, often using visual methods.
Supervised: All data is labeled, and the algorithms learn to predict the output from the input data.
Classification: A classification problem is when the output variable is a category, such as red or blue, or disease and no disease.
Regression: A regression problem is when the output variable is a real value, such as dollars or weight.
Unsupervised: All data is unlabeled, and the algorithms learn to infer inherent structure from the input data.
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people who buy A also tend to buy B.
Semi-supervised: Some data is labeled, but most of it is unlabeled, and a mixture of supervised and unsupervised techniques can be used .
Simple Linear Regression :- it is a statistical method used to model the relationship between a dependent variable (often called the target or outcome) and one or more independent variables (features). The goal is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual data points.
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
regressor.coef_
regressor.intercept_
Y_pred_test=regressor.predict(X_test)
y_test
cost function :- in linear regression, typically the Mean Squared Error (MSE), measures the average squared difference between predicted and actual values, helping to optimize the model's accuracy.
Gradient descent:- in linear regression is an optimization algorithm used to minimize the cost function by iteratively adjusting model parameters in the direction of the steepest descent.
Multiple Linear Regression :-models the relationship between a dependent variable and multiple independent variables by fitting a linear equation to the observed data.
Polynomial Linear Regression :-models the relationship between the dependent and independent variables by fitting a polynomial equation, allowing for non-linear trends.
Multiple polynomial linear regression:- extends multiple linear regression by introducing polynomial terms of the independent variables, allowing for the modeling of non-linear relationships between the dependent and independent variables.
performance matrix :- is a tool used to evaluate the effectiveness of a model by comparing predicted values with actual outcomes using metrics like accuracy, precision, recall, and F1-score.
R-squared shows how well the model fits the data, while adjusted R-squared accounts for the number of features in the model, preventing it from increasing just by adding more features.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
from sklearn.preprocessing import StandardScaler
scaller=StandardScaler()
X_train=scaller.fit_transform(X_train)
X_test=scaller.transform(X_test)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
regressor.coef_
regressor.intercept_
Y_pred_test=regressor.predict(X_test)
y_test
from sklearn.metrics import mean_squared_error,mean_absolute_error
mse=mean_squared_error(y_test,Y_pred_test)
mae=mean_absolute_error(y_test,Y_pred_test)
rmse=np.sqrt(mse)
from sklearn.metrics import r2_score
score=r2_score(y_test,Y_pred_test)
adjusted_r_squared=1-(1-score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
fit(): Computes the mean and standard deviation for scaling based on the data. It stores these values for later transformations but does not alter the data itself.
transform(): Uses the mean and standard deviation calculated with fit() to scale the data. This step adjusts the data to a standardized scale.
fit_transform(): Combines fit() and transform() in one step. This is only applied to the training data in machine learning pipelines to avoid data leakage.
Training Data (X_train):
python code
X_train = scaller.fit_transform(X_train)
Here, fit_transform is applied on X_train. This first computes the mean and standard deviation based on X_train (i.e., fit()) and then scales X_train using these values (i.e., transform()).
Testing Data (X_test):
python code
X_test = scaller.transform(X_test)
For X_test, only transform() is applied. This uses the mean and standard deviation previously computed on X_train (without recomputing them) to ensure that both training and testing data are scaled based on the same values.
Preventing Data Leakage: When training a model, you want the scaling parameters (mean and standard deviation) to be based only on the training data. Using fit_transform() on both X_train and X_test would lead to data leakage, as X_test should be kept separate to provide an unbiased evaluation of the model.
For training data only: Use .fit_transform() to calculate and apply scaling in one step.
For testing/validation data: Use .transform() only, to apply the training data’s scaling parameters.
X and Y:
X: The independent features or predictor variables. These are the inputs you use to predict the outcome.
Y: The target variable (dependent variable). This is what the model is trying to predict.
train_test_split() with test_size=0.2:
When test_size=0.2, 20% of X and Y is allocated to the test set, and 80% to the train set.
X_train and y_train are subsets of X and Y, representing the training data (80% of the data).
X_test and y_test are the testing data (20% of the data), used to evaluate the model.
Training the Model:
For a simple linear regression model, we can describe the relationship as: Predicted Y_train=m(X_train)+C
where: m is the slope or coefficient (derived from regressor.coef_).
C is the intercept (from regressor.intercept_).
The model "learns" the best values for m and C using X_train and y_train.
Testing the Model:
X_test: This contains the unseen 20% of the data from X. It’s used to check how well the model generalizes to new data.
The model generates predictions on this test data with:
Predicted_Y_test=m(X_test)+C
This results in Y_pred_test, a set of predictions for y_test.
Evaluating the Model:
You then compare Y_pred_test (the model’s predictions) with y_test (the actual values) to see how accurate the model is on the test data.
The Python pickle module is used for serializing and deserializing Python objects. Serialization, or "pickling," is the process of converting a Python object (like a list, dictionary, etc.) into a byte stream, which can then be saved to disk. This serialized byte stream contains all the information necessary to recreate the object in another Python script or session.
With pickling, you can easily save Python objects to a file and then later load them back into a program, restoring their state and structure. This makes it convenient for data persistence and object sharing across different programs.
import pickle
pickle.dump(scaler,open('scaling.pkl','wb'))
pickle.dump(regression,open('regmodel.pkl','wb'))
model_regressor=pickle.load(open('regmodel.pkl','rb'))
model_regressor.predict(X_test_scaled)
standerd_scaller=pickle.load(open('scaling.pkl','rb'))
model_regressor.predict(standerd_scaller.transform(X_test))
Ridge regression:- is a type of linear regression that adds a penalty (or "regularization") to the model to prevent overfitting by discouraging large coefficients. It achieves this by adding a term to the loss function, which shrinks the coefficients of less important features toward zero, making the model more stable and better at generalizing to new data.
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
ridge=Ridge()
ridge.fit(X_train_scaled,y_train)
y_pred=ridge.predict(X_test_scaled)
mae=mean_absolute_error(y_test,y_pred)
score=r2_score(y_test,y_pred)
print("Mean absolute error: ", mae)
print("R2 Score: ", score)
Lasso regression:- it is a type of linear regression that adds a penalty to reduce some coefficients to exactly zero, effectively selecting only the most important features for the model and improving prediction accuracy.
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
lasso=Lasso()
lasso.fit(X_train_scaled,y_train)
y_pred=lasso.predict(X_test_scaled)
mae=mean_absolute_error(y_test,y_pred)
score=r2_score(y_test,y_pred)
print("Mean absolute error: ", mae)
print("R2 Score: ", score)
Elastic Net regression:- combines Ridge and Lasso regression penalties to improve linear regression by shrinking some coefficients and setting others to zero. This approach helps prevent overfitting, manages multicollinearity, and selects important features in the model.
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
elastic=ElasticNet()
elastic.fit(X_train_scaled,y_train)
y_pred=elastic.predict(X_test_scaled)
mae=mean_absolute_error(y_test,y_pred)
score=r2_score(y_test,y_pred)
print("Mean absolute error: ", mae)
print("R2 Score: ", score)
home.html
<!DOCTYPE html>
<html>
<head>
<title>FWI Prediction</title>
</head>
<body>
<div class="login">
<h1>FWI Prediction</h1>
<!-- Main Input Form for Receiving Query to our ML Model -->
<form action="{{ url_for('predict_datapoint') }}" method="post">
<input type="text" name="Temperature" placeholder="Temperature" required="required" /><br>
<input type="text" name="RH" placeholder="RH" required="required" /><br>
<input type="text" name="Ws" placeholder="Ws" required="required" /><br>
<input type="text" name="Rain" placeholder="Rain" required="required" /><br>
<input type="text" name="FFMC" placeholder="FFMC" required="required" /><br>
<input type="text" name="DMC" placeholder="DMC" required="required" /><br>
<input type="text" name="ISI" placeholder="ISI" required="required" /><br>
<input type="text" name="Classes" placeholder="Classes" required="required" /><br>
<input type="text" name="Region" placeholder="Region" required="required" /><br>
<button type="submit" class="btn btn-primary btn-block btn-large">Predict</button>
</form>
<p>The FWI prediction is: {{ result }}</p>
</div>
</body>
</html>
app.py->
import pickle
from flask import Flask, request, jsonify, render_template
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
app = Flask(__name__)
# Load Ridge regressor model and standard scaler from pickle files
ridge_model = pickle.load(open('models/ridge.pkl', 'rb'))
standard_scaler = pickle.load(open('models/scaler.pkl', 'rb'))
# Route for home page
@app.route('/')
def index():
return render_template('home.html')
# Route for prediction
@app.route('/predictdata', methods=['GET', 'POST'])
def predict_datapoint():
if request.method == 'POST':
# Get values from form input
Temperature = float(request.form.get('Temperature'))
RH = float(request.form.get('RH'))
Ws = float(request.form.get('Ws'))
Rain = float(request.form.get('Rain'))
FFMC = float(request.form.get('FFMC'))
DMC = float(request.form.get('DMC'))
ISI = float(request.form.get('ISI'))
Classes = float(request.form.get('Classes'))
Region = float(request.form.get('Region'))
# Combine inputs into a numpy array and scale them
new_data = np.array([[Temperature, RH, Ws, Rain, FFMC, DMC, ISI, Classes, Region]])
new_data_scaled = standard_scaler.transform(new_data)
# Make prediction
prediction = ridge_model.predict(new_data_scaled)
# Render the result on the home page
return render_template('home.html', result=prediction[0])
else:
return render_template('home.html')
if __name__ == "__main__":
app.run(host="0.0.0.0")