Regression model

Data Analysis and Regression Modeling with Pandas

Objective:

The objective of this lab is to apply data analysis techniques using pandas and train a linear regression model to predict the calorie burnage of players in a training dataset.

Prerequisites:

Python 3.x
Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, statsmodels
Dataset containing players parameters in data.csv

Task:

Load the dataset in pandas
Clean data from errors and missing entries
Convert data types to useful ones
Train linear regression model
See the linear relation graphically using scatter plot
Train an OLS model

The dataset, data.csv, contains health records of players in training. It has the following columns:

Player_ID: Unique identifier for each player.
Duration: Training session duration in minutes.
Average_Pulse: Average heart rate during training.
Max_Pulse: Maximum heart rate during training.
Calorie_Burnage: Calories burned during training.

The first row contains headers, and values are separated by commas.

Step 1: Load the Dataset

import pandas as pd

df = pd.read_csv("data.csv")

print(df.head())

Step 2: Data Cleaning

Invalid entries should be removed to maintain data integrity.

# Drop rows with missing values

df_cleaned = df.dropna()

# Drop rows where 'Calorie_Burnage' or 'Average_Pulse' are non-numeric

df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Average_Pulse'], errors='coerce').notna()]

df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Calorie_Burnage'], errors='coerce').notna()]

print(df_cleaned.info())

Step 3: Convert Data Type of Average_Pulse

df_cleaned['Average_Pulse'] = df_cleaned['Average_Pulse'].astype('float64')

print(df_cleaned.dtypes)

Step 4: Train a Linear Regression Model

We train a linear regression model to predict Calorie_Burnage from Average_Pulse.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

X = df_cleaned[['Average_Pulse']]

y = df_cleaned['Calorie_Burnage']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

print("Model Coefficients:", model.coef_)

print("Model Intercept:", model.intercept_)

Step 5: Scatter Plot Between Calorie_Burnage and Average_Pulse

import matplotlib.pyplot as plt

import seaborn as sns

sns.scatterplot(x=df_cleaned['Average_Pulse'], y=df_cleaned['Calorie_Burnage'])

plt.xlabel("Average Pulse")

plt.ylabel("Calorie Burnage")

plt.title("Scatter Plot of Calorie Burnage vs Average Pulse")

plt.show()

Step 6: Train an OLS Model Using Average_Pulse and Duration

import statsmodels.api as sm

X_ols = df_cleaned[['Average_Pulse', 'Duration']]

X_ols = sm.add_constant(X_ols) # Adds intercept term

y_ols = df_cleaned['Calorie_Burnage']

ols_model = sm.OLS(y_ols, X_ols).fit()

print(ols_model.summary())

Conclusion:

In this lab, we applied pandas to clean and preprocess data, trained a linear regression model to predict Calorie_Burnage using Average_Pulse, and visualized the relationship using a scatter plot. We also used an OLS model to evaluate the impact of both Average_Pulse and Duration on Calorie_Burnage.

Page updated

Google Sites

Report abuse