The objective of this lab is to apply data analysis techniques using pandas and train a linear regression model to predict the calorie burnage of players in a training dataset.
Python 3.x
Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, statsmodels
Dataset containing players parameters in data.csv
Load the dataset in pandas
Clean data from errors and missing entries
Convert data types to useful ones
Train linear regression model
See the linear relation graphically using scatter plot
Train an OLS model
The dataset, data.csv, contains health records of players in training. It has the following columns:
Player_ID: Unique identifier for each player.
Duration: Training session duration in minutes.
Average_Pulse: Average heart rate during training.
Max_Pulse: Maximum heart rate during training.
Calorie_Burnage: Calories burned during training.
The first row contains headers, and values are separated by commas.
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
Invalid entries should be removed to maintain data integrity.
# Drop rows with missing values
df_cleaned = df.dropna()
# Drop rows where 'Calorie_Burnage' or 'Average_Pulse' are non-numeric
df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Average_Pulse'], errors='coerce').notna()]
df_cleaned = df_cleaned[pd.to_numeric(df_cleaned['Calorie_Burnage'], errors='coerce').notna()]
print(df_cleaned.info())
df_cleaned['Average_Pulse'] = df_cleaned['Average_Pulse'].astype('float64')
print(df_cleaned.dtypes)
We train a linear regression model to predict Calorie_Burnage from Average_Pulse.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = df_cleaned[['Average_Pulse']]
y = df_cleaned['Calorie_Burnage']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x=df_cleaned['Average_Pulse'], y=df_cleaned['Calorie_Burnage'])
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.title("Scatter Plot of Calorie Burnage vs Average Pulse")
plt.show()
import statsmodels.api as sm
X_ols = df_cleaned[['Average_Pulse', 'Duration']]
X_ols = sm.add_constant(X_ols) # Adds intercept term
y_ols = df_cleaned['Calorie_Burnage']
ols_model = sm.OLS(y_ols, X_ols).fit()
print(ols_model.summary())
In this lab, we applied pandas to clean and preprocess data, trained a linear regression model to predict Calorie_Burnage using Average_Pulse, and visualized the relationship using a scatter plot. We also used an OLS model to evaluate the impact of both Average_Pulse and Duration on Calorie_Burnage.