A GUIDE - Prediction Using XGB Regressor using Python 3

This is a complete guide on how to use XGB Regressor and why to use XGB Regressor

PROBLEM STATEMENT

Estimating customer value and extrapolating the existing value into the future are the sought-after problems of any business. The current problem hails from the domain of online skill-based gaming.

ABOUT THE SITUATION

Provided feature space contains 22 variables spanning multiple aspects of customer behaviour. Temporal variation is captured by a numbered sequence of entries for each customer. The goal is to predict Y1 & Y2 which represent customer value and temporal extrapolation respectively. Each customer can have N sequenced entries as input and only one set of (Y1, Y2) is expected as output.

The desired output should be in the below-mentioned format:

Let’s get started

note: the data has been taken from Kaggle

First, let’s import the data in python

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# For example, here’s several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only “../input/” directory

# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

for dirname, _, filenames in os.walk(‘/kaggle/input’):

for filename in filenames:

print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using “Save & Run All”

# You can also write temporary files to /kaggle/temp/, but they won’t be saved outside of the current session

We have installed all the important libraries, let’s check the dataset.

df = pd.read_csv(“../input/beyond-analysis/train.csv”)

df

The data has been uploaded. Now, it’s time to explore the data to gather some insights, the process is known as Exploratory Data Analysis.

Exploratory Data Analysis

Why EDA?

  • helps in looking at the data before making any assumptions

  • helps identify obvious errors

  • detects outliers or anomalous events

  • find interesting relations among the variables.

df.shape

df.isnull().sum()

df.describe()

df.nunique()

It’s time to find out which all columns have a huge impact on the target value. For this we’ll find correlation

df.corr()

print(correlations[“Y1”])

print(correlations[“Y2”])

The columns which have a negative correlation will be removed and to increase the accuracy level the columns which have a value less than 0.2 will also be removed.

Now it’s time to check the outliers.

The first and foremost question after listening to the word outliers is that — what is an outlier.

In lay man language, outliers are nothing but values which are far away from the group data.

For example –

There are 10 students in a class and the have given a maths test.

But three students were absent, so they get zero out of 100 and other students get marks like:

80

50

60

90

100

50

65

So the average of 10 students will be student 1 +student 2…..+ student 10 / 10 but his won’t an accurate average as 3 students who didn’t give the exam and scored zero will reduce the actual average so we will remove the marks of these 3 students and now we’ll calculate the average using — student 1 + student 2 ….. student 7/7. This average will be more accurate than the previous one.

Hence, that’s the importance of removing outliers.

But first we need to check which all columns have outliers, for this we’ll use matplotlib.pyplot liabrary and draw a box plot.

import matplotlib.pyplot as plt

import seaborn as sns

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=”STATUS_CHECK”,y=”Y1",data=df)

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=”STATUS_CHECK”,y=”Y2",data=df)

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=df[“ENTRY”],)

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=df[“REVENUE”])

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=df[“WINNINGS_1”])

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=df[“DEPOSIT”])

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=df[“SEQUENCE_NO”])

Now we have all the important columns that have a direct impact on the target values and the outliers in them.

It’s time to remove the outliers

For removal of outliers — we’ll use Quantile based flooring & capping and Inter Quartile Deviation

up=(df[‘Y1’].quantile(0.10))

low=(df[‘Y1’].quantile(0.90))

print(up)

print(low)

print(df[‘Y1’].skew())

df1[“Y1”] = np.where(df1[“Y1”] <up, low,df1[‘Y1’])

df1[“Y1”] = np.where(df1[“Y1”] >up, low,df1[‘Y1’])

print(df1[‘Y1’].skew())

up_1 = (df[‘Y2’].quantile(0.10))

low_1 = (df[‘Y2’].quantile(0.90))

print(low_1)

print(up_1)

print(df[‘Y2’].skew())

df1[“Y2”] = np.where(df1[“Y2”] <up_1, low_1,df1[‘Y2’])

df1[“Y2”] = np.where(df1[“Y2”] >up_1, low_1,df1[‘Y2’])

print(df1[‘Y2’].skew())

Q1 = df.quantile(0.25)

Q3 = df.quantile(0.75)

IQR = Q3 — Q1

print(IQR) #Calculating Inter Quartile Range

BP = df[~((df < (Q1–1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

BP.shape # Q1–1.5 * IQR remove qutlier below 1.5 in Q1 and vice vera in the case of Q3 + 1.5 * IQR

Now, some of the outliers has been removed now its time to check again using box plot

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=”STATUS_CHECK”,y=”Y1",data=BP)

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=”STATUS_CHECK”,y=”Y1",data=BP)

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=BP[“ENTRY”],)

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=BP[“REVENUE”])

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=BP[“WINNINGS_1”])

fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=BP[“DEPOSIT”])


fig = plt.figure(figsize =(10, 10))

ax=sns.boxplot(x=BP[“SEQUENCE_NO”])

Now, it can be observed that there are only a few outliers left. Hence, we are good to go.

After removing outliers, it is important to understand if there is a pattern between any target values so that prediction can be easily done.

Hence, for this we’ll do Relationship Analysis

sns.set(style=”darkgrid”)

fig = sns.kdeplot(BP[‘Y1’], shade=True, color=”r”)

fig = sns.kdeplot(BP[‘Y2’], shade=True, color=”b”)

plt.show()

#kdeploe has been made in comparison between R&D Spend and Profit

# Focus on two variables Y1 and Y2.

# It shows same pattern observed in the previous readings.

aq_df_2015=BP

pm_data_2015=BP[[‘Y1’,’Y2']]

pm_data_2015.plot(subplots=True)

BP[[‘Y1’,’Y2']].hist()

BP[[‘Y1’]].plot(kind=’density’)


BP[[‘Y2’]].plot(kind=’density’)

pd.plotting.lag_plot(BP[‘Y1’],lag=1)

pd.plotting.lag_plot(BP[‘Y2’],lag=1)

# Lag = 10

pd.plotting.lag_plot(BP[‘Y1’],lag=10)

Now run the MODEL and check what is the accuracy rate of it using RSME.

train =pd.read_csv(‘trainfold_5.csv’)

test = pd.read_csv(‘../input/beyond-analysis/test.csv’)

train.drop_duplicates(subset=[‘UNIQUE_IDENTIFIER’],keep=’first’,inplace=True)

test.drop_duplicates(subset=[‘UNIQUE_IDENTIFIER’],keep=’first’,inplace=True)

X= train.drop([‘Y1’,’Y2',’CATEGORY_1',’CATEGORY_2',’kfold’],axis=1)

test=test.drop([‘CATEGORY_1’,’CATEGORY_2'],axis=1)

test_id= test.UNIQUE_IDENTIFIER

test

y=train[[‘Y1’,’Y2']]

y

xtrain,xvalid,ytrain,yvalid = model_selection.train_test_split(X,y,test_size=0.2,random_state=42)

params={‘learning_rate’: 0.04122630533921803, ‘max_depth’: 8}

xgb_params = {

‘learning_rate’: 0.1325,

‘subsample’: 0.7875490025178,

‘colsample_bytree’: 0.11807135201147,

‘max_depth’: 2,

‘reg_lambda’: 0.0008746338866473539,

‘reg_alpha’: 23.13181079976304,

}

linkcode

import lightgbm as lgb

params_lgb = {

“task”: “train”,

“boosting_type”: “gbdt”,

“objective”: “regression”,

‘subsample’: 0.95312,

“metric”: “rmse”,

‘learning_rate’: 0.04135,

“max_depth”: 2,

“feature_fraction”: 0.2256038826485174,

“bagging_fraction”: 0.7705303688019942,

“min_child_samples”: 290,

“reg_alpha”: 14.68267919457715,

“reg_lambda”: 66.156,

“max_bin”: 772,

“min_data_per_group”: 177,

“bagging_freq”: 1,

“cat_smooth”: 96,

“cat_l2”: 17,

“verbosity”: -1,

‘random_state’:42,

‘n_estimators’:10000,

‘colsample_bytree’:0.1107

}

lgb_train = lgb.Dataset(xtrain, ytrain[‘Y1’])

lgb_val = lgb.Dataset(xvalid, yvalid[‘Y2’])

model = lgb.train(params=params_lgb,

train_set=lgb_train,

valid_sets=lgb_val,

early_stopping_rounds=100,

verbose_eval=1000)

preds_valid = model.predict(xvalid,num_iteration=model.best_iteration)

test_pre = model.predict(test,num_iteration=model.best_iteration)

#preds_valid = model.predict(xvalid)

#Training model apply the test data and predict the output

test_pre = model.predict(test)

test_pre/=5

params_lgb = {

“task”: “train”,

“boosting_type”: “gbdt”,

“objective”: “regression”,

‘subsample’: 0.95312,

“metric”: “rmse”,

‘learning_rate’: 0.04135,

“max_depth”: 2,

“feature_fraction”: 0.2256038826485174,

“bagging_fraction”: 0.7705303688019942,

“min_child_samples”: 290,

“reg_alpha”: 14.68267919457715,

“reg_lambda”: 66.156,

“max_bin”: 772,

“min_data_per_group”: 177,

“bagging_freq”: 1,

“cat_smooth”: 96,

“cat_l2”: 17,

“verbosity”: -1,

‘random_state’:42,

‘n_estimators’:10000,

‘colsample_bytree’:0.1107

}

lgb_train = lgb.Dataset(xtrain, ytrain[‘Y1’])

lgb_val = lgb.Dataset(xvalid, yvalid[‘Y2’])

model = lgb.train(params=params_lgb,

train_set=lgb_train,

valid_sets=lgb_val,

early_stopping_rounds=100,

verbose_eval=1000)

preds_valid = model.predict(xvalid,num_iteration=model.best_iteration)

test_pre0 = model.predict(xtest,num_iteration=model.best_iteration)

#model.fit(xtrain,ytrain[‘Y2’],early_stopping_rounds=100,eval_set=[(xvalid,yvalid[‘Y2’])],verbose=1000)

#preds_valid = model.predict(xvalid)

#Training model apply the test data and predict the output

#test_pre0 = model.predict(test)

test_pre0/=5

yvalid

Hence, we have trained the model and got all the predictions. Now, its time to check how accurate are these predictions, for we’ll use RSME

· Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

· Residuals are a measure of how far from the regression line data points are

· RMSE is a measure of how to spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

from math import *

y1, y2 = yvalid[‘Y1’], yvalid[‘Y2’]

print(f’Y1 rmse: {sqrt(mean_squared_error(y1, model.predict(xvalid)))}’)

print(f’Y2 rmse: {sqrt(mean_squared_error(y2, model.predict(xvalid)))}’)

preds1=test_pre

preds2=test_pre0

submission = pd.DataFrame()

submission[“UNIQUE_IDENTIFIER”] = test_id

submission[“Y1”] = preds1

submission[“Y2”] = preds2

submission.to_csv(“Submission-15.csv”,index=False)

submission.head()

It can be observed that RSME is good enough to say that our prediction is almost accurate.

MODEL SELECTION

A supervised learning algorithm was used because the target variables Y1 and Y2, which needs to be predicted, will be a function of the other variables in the dataset.

Since the output variable is a real value, we used regression algorithms for our model.

THE MODEL

It was observed that the XGB regressor outperformed the other models in terms of RMSE score. After that, the data was split into training and test datasets.

Then the data has tuned the model by changing the learning rate and seeing for which one the test RMSE is least deviating for the training RMSE for 500 iterations.

NATURE OF PREDICTIONS

  • For each Y1 and Y2, separate models were made and got unique results for each target variable.

  • Also using the correlations table calculated among the features and the targets was used for feature engineering and improve predictions.

CHALLENGES FACED

  • For Y1 the RMSE scores for train and test sets were not so close as expected in almost all cases as compared to for Y2.

  • Also, for Y1, the RMSE score saturated at about 200 iterations and didn’t improve much beyond that.

LEARNINGS

Why XGB Regressor outperformed other models

i) Parallel creation of trees.

ii) Trees pruning using the depth-first approach.

iii) Out-of-core computing.

iv) Regularization to avoid overfitting.

v) Handle missing values using a sparse approach.

vi) Inbuilt cross-validation.

Contacts

In case you have any questions or any suggestions on what my next article should be about, please leave a comment below or mail me at aryanbajaj104@gmail.com.

If you want to keep updated with my latest articles and projects, keep visiting the website ^_^.

Connect with me via:

LinkedIn