Student Performance Prediction

Can you accurately predict performance of students based on previous semesters data?

source

Introduction

In present educational systems, student performance prediction is getting worsen day by day. Predicting student performance in advance can help students and their teacher to keep track of progress of a student. Many institutes have adopted continuous evaluation system today. Such systems are beneficial to the students in improving performance of a student. The purpose of continuous evaluation system is to help regular students.

In this project, to predict course wise performance of students we use UCI machine learning dataset of students and build a model using DNN and deploy that model into GUI using flask.

Prerequisites

  • Python 3.+

  • Understanding of libraries (Keras, Scikit Learn, Numpy, Pandas, flask, Matplotlib, Seaborn)

  • Google Colab

  • Basic understanding of machine learning classification methods or Algorithms.

Problem Description

The challenge of this project is to predict performance of students into particular subjects based on the previous semesters data. We have two different subjects data: Mathematics and Portuguese language and some other data from that we have to predict current semester performance of students in particular subjects using Deep neural network model.

Data Set Information

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Process of predicting student performance

Import DataSet:

  • To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is the path of data file.

Example:

Path = “URL”

data = pd.read_csv(Path)

  • This dataset contain many string value, continuous value and also some null value. So first of all we have to clean the data.


Convert G1, G2 and G3 columns into 5 different Classes:

  • Now we have student grade in range of 0 to 20 so we have classify them into 5 different group based on their grades.

  • The loc() function is used to access a group of rows and columns by label(s) or a boolean array.

  • .loc[] is primarily label based, but may also be used with a boolean array.

Example:

data.loc[(data.G3 >= 18) & (data.G3 <= 20), 'FinalGrade'] = 'Excellent'

data.loc[(data.G3 >= 15) & (data.G3 <= 17), 'FinalGrade'] = 'Good'

data.loc[(data.G3 >= 11) & (data.G3 <= 14), 'FinalGrade'] = 'Satisfactory'

data.loc[(data.G3 >= 6) & (data.G3 <= 10), 'FinalGrade'] = 'Poor'

data.loc[(data.G3 >= 0) & (data.G3 <= 5), 'FinalGrade'] = 'Failure'


Encode the necessary Columns :

  • In dataset we have many columns having string data but categorical value.

  • So we have convert that string data into integer data by mapping them to an integer value which is same for same string value. For this process we use label encoder and get_dummies() function.

Label Encoder:

  • Label Encoding in Python can be achieved using Sklearn Library. Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels.

Example:

le=LabelEncoder()

data.Mjob=le.fit_transform(data.Mjob)

get_dummies functioin() :

  • The get_dummies() function is used to convert categorical variable into dummy/indicator variables.

Example:

School = pd.get_dummies(data['school'],prefix='school',columns=['school'],drop_first=True)

  • After the above mention step now our dataset is to apply to a model.

  • Now, first we have to divide our dataset into training and testing data so we can use training data for checking train accuracy of our model and use testing data for check how accurately model predict the unseen data.


Split the data into features and target variables:

  • Now we separate our data into features and target variables. Features columns are stored in variable ‘X’ and target variable is stored in variable ‘y’.

Feature selection methods:

1. Univariate Selection

  • Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data.

  • The example below uses the chi-squared (chi²) statistical test for non-negative features to select 23 of the best features from the Student Performance Prediction Dataset.

Example:

bestfeatures = SelectKBest(score_func=chi2, k=23)

fit = bestfeatures.fit(X,y)


2. Feature Importance

  • You can get the feature importance of each feature of your dataset by using the feature importance property of the model.

  • Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.

  • Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

Example:

model = ExtraTreesClassifier()

model.fit(X,y)

print(model.feature_importances_)


3. Correlation Matrix

  • Correlation states how the features are related to each other or the target variable.

  • Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)

  • Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

Example:

corrmat = data.corr()

top_corr_features = corrmat.index


Remove unnecessary features:

  • From above features selection methods, we have observed that some features are decrease the accuracy. So here we have dropped that features.


Divide Data in training and testing set:

  • To divide the data into training and testing set we use K Fold cross validation method.

  • K-Fold is use when we have small dataset. In K-Fold the training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:

        1. A model is trained using k−1 of the folds as training data;

        2. The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

  • This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set).

  • We can import KFold Method from sklearn.model_selection. This provides train/test indices to split data in train/test.

Example:

kf = KFold(n_splits=5,shuffle=False)

kf.split(X)


Create DNN model:

  • We use keras to develop a DNN model. Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models.

  • It wraps the efficient numerical computation libraries ‘Theano’ and ‘TensorFlow’ and allows you to define and train neural network models in just a few lines of code.

  • Here we have used ‘Sequential model’ with fully-connected network structure with six layers.

  • We have used the rectified linear unit activation function referred to as ReLU on the first five layers and the softmax function in the output layer. Here we have use softmax activation function instead of sigmoid in output layer because we have more than two output class.

Example:

model=Sequential()

model.add(Dense(50,input_dim=23, activation='relu'))

model.add(Dense(35,activation='relu'))

model.add(Dense(20,activation='relu'))

model.add(Dense(14,activation='relu'))

model.add(Dense(8,activation='relu'))

model.add(Dense(5,activation='softmax'))

  • The model expects rows of data with 23 variables. The first hidden layer has 35 nodes and use the relu activation function and same way in rest of hidden layers. The output layer has 5 nodes and uses the softmax activation function.


Compilation of model:

  • After defining model, we have to compile it.

  • Here we have used cross entropy as the loss argument. This loss is for a multiclass classification problem and is defined in keras as “categorical_crossentropy”.

  • We will define the optimizer as the efficient stochastic gradient descent algorithm “adam“. This is a popular version of gradient descent because it automatically tunes itself and gives good results in a wide range of problems.

  • Finally, because it is a classification problem, we will collect and report the classification accuracy, defined via the metrics argument.

Example:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


Make Modek Chekpoint file to store the value of weights when accuracy is improve:

  • When training deep learning models, the checkpoint is the weights of the model. These weights can be used to make predictions as is, or used as the basis for ongoing training.

  • A simpler check-point strategy is to save the model weights to the same file, if and only if the validation accuracy improves.

  • In the code below, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

Example:

filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

Fit the model:

  • We can train or fit our model on our loaded data by calling the fit() function on the model.

  • Training occurs over epochs and each epoch is split into batches.

    1. Epoch: One pass through all of the rows in the training dataset.

    2. Batch: One or more samples considered by the model within an epoch before weights are updated.

  • One epoch is comprised of one or more batches, based on the chosen batch size and the model is fit for many epochs.

  • The training process will run for a fixed number of iterations through the dataset called epochs, that we must specify using the epochs argument. We must also set the number of dataset rows that are considered before the model weights are updated within each epoch, called the batch size and set using the batch_size argument.

  • We have use K-Fold cross validation method so we have to iterate this process foe k times.

Example:

history=model.fit(X_train,y_train,epochs=800,batch_size=32,

validation_data=(X_test, y_test),callbacks=callbacks_list,verbose=0)


Evaluate the model:

  • We have trained our neural network on the entire dataset and we can evaluate the performance of the network on the same dataset.

  • We can evaluate our model on our training dataset using the evaluate() function on our model and pass it the same input and output used to train the model.

  • This will generate a prediction for each input and output pair and collect scores, including the average loss and any metrics you have configured, such as accuracy.

  • The evaluate() function will return a list with two values. The first will be the loss of the model on the dataset and the second will be the accuracy of the model on the dataset.

Example:

train_loss, train_acc=model.evaluate(X_train,y_train, verbose=0)

test_loss, test_acc=model.evaluate(X_test,y_test,verbose=0)



Store DNN model in pickle file:

  • We can store our compiled and fit model in pickle file using pickle.dump() method.

  • Python pickle module is used for serializing and de-serializing python object structures. The process to converts any kind of python objects (list, dict, etc.) into byte streams (0s and 1s) is called pickling or serialization or flattening or marshalling. We can converts the byte stream (generated through pickling) back into python objects by a process called as unpickling.

Example:

filename='KFold_final_classification_model.pkl'

pickle.dump(model,open(filename,'wb'))


Load the pickle file:

  • We can load our previously stored pickle file using pickle.load() method and evaluate the model performance.

Example:

load_model=pickle.load(open(filename,'rb'))

result=load_model.evaluate(X_test,y_test)


Download the pickle file:

  • We can download our previously stored pickle file.

Example:

files.download('KFold_final_classification_model.pkl')


Deploy model into GUI using flask:

  • After generate pickle file of student performance prediction model we will use flask which is a light web framework to create a web based gui for predict student performance.

  • In google colab we cannot access localhost so for running our code we have use python library flask-ngrok.

  • For install flask-ngrol in google colab:

!pip install flask-ngrok

  • Now we have to create gui for our website using html, css. Once user fill all required information than this all information will send to app.py file using POST method which will predict the performance of the student based on given data.

  • Now we have to make our app.py file

  • In app.py file we will receive the data send from .html file. Then import all the libraries and method that we are going to use.

import numpy as np

from flask import Flask, request, jsonify, render_template

import pickle

from flask_ngrok import run_with_ngrok

  • Here we have to load out model using pickle.load() method.

app = Flask(__name__, template_folder='templates')

model = pickle.load(open(modelName.pkl', 'rb'))

run_with_ngrok(app)

  • Now we have define predict() method which will receive the data and convert it into numpy array and after that we give this array to our model which will predict the performance of the student and store result into variable output.

performance = {

0 : "Excellent",

1 : "Failure",

2 : "Good",

3 : "Poor",

4 : "Satisfactory"

}

  • Now we have output variable of integer type, so we will convert this integer into appropriate performance class using dictionary named ‘Performance’.

return render_template('index.html', prediction_text='Student Grade will be {}.'.format(performance[output]))

  • So now our Prediction of student performance will be display on the web page.

Conclusion:

By completing this project we notice that if we have large number of data for training, high accuracy can be easy achieve by using DNN model. Here we have small dataset, so we have used k fold for splitting dataset into train and testing and that's why we can achieve high accuracy for this model. Student performance prediction model is very import now day because by predicting performance of students teachers can teach students accordingly and students can work or learn accordingly to the predicted performance.

Co-author : Abhi Kaila

Guide : Priyanka Patel(Asst. Professor, CSPIT, CHARUSAT)


Copyright © All Right Reserved By Dixit Kapuriya