In this activity, we used StudentEvent dataset which original dataset has been pivot by Student ID and event context and also has been grouping by same event context. This dataset has been standardized in Rapidminer.
path1 = "/content/drive/My Drive/Colab Notebooks/StudentEvent.xlsx"
dataf1 = pd.read_excel(path1)
dataf1.head(3)
Using the pd.read_excel() function will bring the excel file into a pandas DataFrame. This is our dataset after preprocessing in Excel.
dataf1.info()
By using info() we can observe the datatype, the total number of rows, the number of rows with a non-null value. After pre-processing, we can see that there is no missing value for all 35 rows, and out of 11 columns, there are only two columns with object datatype. Other than that, one column with int64 datatype and eight columns are float64 type. It is because we are combining the dataset with the same group event context.
Since the data has been standardize in Rapidminer, we use .head to see the dataset values.
dataf1.head(3)
After pre-processing, we identified 7 variables as input variables and MarksBin as output variables. All the input variables are:
Assignment
Forum
Activity
LectureNote
Tutorial
Questionnaire
Quiz
data1 = dataf1[['Assignment','Forum','Activity','LectureNote','Tutorial','Questionnaire','Quiz','MarksBin']]
data1
data1.describe()
The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. Since our dataset has been standardized, the standard deviation value for all input columns is the same which is 1.
data1.hist(figsize=(16, 15), bins=50, xlabelsize=8, ylabelsize=8); # ; avoid having the matplotlib verbose informations
To create a histogram, we use the pandas hist() method. Calling the hist() method on a pandas data frame will return histograms for all non-nuisance series in the data frame. A histogram allows us to see the distribution for all non-nuisance series variable.
data1.plot.hist(bins=50, alpha=0.7)
data1.plot()
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. We use scatter plots to observe relationships between variables.
for i in range(0, len(data1.columns), 4):
sns.pairplot(data=data1,
x_vars=data1.columns[i:i+5],
y_vars=['MarksBin'])
corr() is used to find the pairwise correlation of all columns in the data frame. Any null values are automatically excluded. For any non-numeric data type columns in the data frame, it is ignored.
data1_corr = data1.corr()['MarksBin'][:-1] # -1 because the last column is MarksBin
golden_features_list = data1_corr[abs(data1_corr) >= 0.1].sort_values(ascending=False)
print("There is {} strongly correlated values with MarksBin:\n{}".format(len(golden_features_list), golden_features_list))
For this dataset, there are 2 variables that have correlation with more than 0.1.
Correlation is a statistic that measures the degree to which two variables move in relation to each other.
corr = data1.drop('MarksBin', axis=1).corr() # We already examined MarksBin correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr[(corr >= 0.1) | (corr <= -0.1)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);