SSK4604-Data Mining

EDA After Pre-processing

Introduction To Dataset

In this activity, we used StudentEvent dataset which original dataset has been pivot by Student ID and event context and also has been grouping by same event context. This dataset has been standardized in Rapidminer.

StudentEvent Dataset

Importing Data

path1 = "/content/drive/My Drive/Colab Notebooks/StudentEvent.xlsx"

dataf1 = pd.read_excel(path1)

dataf1.head(3)

Using the pd.read_excel() function will bring the excel file into a pandas DataFrame. This is our dataset after preprocessing in Excel.

Observe The Data Information

dataf1.info()

By using info() we can observe the datatype, the total number of rows, the number of rows with a non-null value. After pre-processing, we can see that there is no missing value for all 35 rows, and out of 11 columns, there are only two columns with object datatype. Other than that, one column with int64 datatype and eight columns are float64 type. It is because we are combining the dataset with the same group event context.

Observe The Data

Since the data has been standardize in Rapidminer, we use .head to see the dataset values.

dataf1.head(3)

Selected Data

After pre-processing, we identified 7 variables as input variables and MarksBin as output variables. All the input variables are:

Assignment
Forum
Activity
LectureNote
Tutorial
Questionnaire
Quiz

data1 = dataf1[['Assignment','Forum','Activity','LectureNote','Tutorial','Questionnaire','Quiz','MarksBin']]

data1

Summarize Dataset

data1.describe()

The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. Since our dataset has been standardized, the standard deviation value for all input columns is the same which is 1.

Histogram

data1.hist(figsize=(16, 15), bins=50, xlabelsize=8, ylabelsize=8); # ; avoid having the matplotlib verbose informations

To create a histogram, we use the pandas hist() method. Calling the hist() method on a pandas data frame will return histograms for all non-nuisance series in the data frame. A histogram allows us to see the distribution for all non-nuisance series variable.

data1.plot.hist(bins=50, alpha=0.7)

Line Style Plot

data1.plot()

Scatter Plot

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. We use scatter plots to observe relationships between variables.

for i in range(0, len(data1.columns), 4):

sns.pairplot(data=data1,

x_vars=data1.columns[i:i+5],

y_vars=['MarksBin'])

Heatmap

corr() is used to find the pairwise correlation of all columns in the data frame. Any null values are automatically excluded. For any non-numeric data type columns in the data frame, it is ignored.

data1_corr = data1.corr()['MarksBin'][:-1] # -1 because the last column is MarksBin

golden_features_list = data1_corr[abs(data1_corr) >= 0.1].sort_values(ascending=False)

print("There is {} strongly correlated values with MarksBin:\n{}".format(len(golden_features_list), golden_features_list))

For this dataset, there are 2 variables that have correlation with more than 0.1.

Correlation is a statistic that measures the degree to which two variables move in relation to each other.

corr = data1.drop('MarksBin', axis=1).corr() # We already examined MarksBin correlations

plt.figure(figsize=(12, 10))

sns.heatmap(corr[(corr >= 0.1) | (corr <= -0.1)],

cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,

annot=True, annot_kws={"size": 8}, square=True);

Next Topic: Data Visualization

Page updated

Report abuse

EDA After Pre-processing

EDA After Pre-processing

Introduction To Dataset

StudentEvent Dataset

Importing Data

Observe The Data Information

Observe The Data

Selected Data

Summarize Dataset

Histogram

Line Style Plot

Scatter Plot

Heatmap

Next Topic: Data Visualization

Copyright by 199607-Build using sites.google.com