SSK4604-Data Mining - EDA Before Pre-processing

EDA Before Pre-processing

Introduction To Dataset

In this activity, we used a Data-Binning dataset; which is the original dataset that has been pivot by Student ID and event context but not yet grouping by the same event context. This dataset contains 35 rows with 27 columns.

Data-Binning Dataset

Importing Data

pathd = "/content/drive/My Drive/Colab Notebooks/Data-Binning.xlsx"

dfb = pd.read_excel(pathd)

dfb.head(3)

Using the pd.read_excel() function will bring the excel file into a pandas DataFrame, with the lines parameter as True- because every new object is separated by a new line.

Observe The Data

dfb.info()

By using dfb_info() we can observe the datatype, the total number of rows, the number of rows with a non-null value. For this dataset, we can see that there is no missing value and out of 27 columns, there are only two columns with object datatype. Other than that, three columns with int64 datatype and twenty-two columns are float64 type.

Rename Column

Firstly, we handle the naming of columns for ease-of-access in pandas.

dfb.rename(columns={'Student ID': 'StudentID',

'Assignment: Assignment 2':'Assignment',

'Assignment: Capstone Project Submission':'ProjectSubmission',

'File: Guide for Data Preprocessing':'AccessFile',

'Forum: Activity - Build Your First Dashboard!':'Forum1',

'Forum: Activity - Reflection of Experience':'Forum2',

'Forum: Activity - Share your first RapidMiner project experience!':'Forum3',

'Forum: Activity 1- Data Analytic in Action':'Forum4',

'Interactive Content: Activity - Find the Pairs!':'Activity1',

'Interactive Content: Activity - Supervised and Unsupervised Machine Learning Video with Interactive Quiz':'Activity2',

'Interactive Content: Activity-Pair the Machine Learning Algo with Suitable Concepts':'Activity3',

'Interactive Content: Activity-Terminology':'Activity4',

'Page: Activity - Your Data Analytics Apps Experience':'Page1',

'Page: Activity -Test Your Understanding - Quizziz':'Page2',

'Page: Lecture Notes - Data Preprocessing':'Page3',

'Page: Lecture Notes - Exploratory Data Analysis':'Page4',

'Page: Lecture Notes - Introduction to Data Analytics':'Page5',

'Page: Lecture Notes - Machine Learning':'Page6',

'Page: Tutorial - Using RapidMiner for Data Analytics':'Page7',

'Page: Tutorial on Exploratory Data Analysis using Power BI':'Page8',

'Questionnaire: Self Assessment':'Questionnaire',

'Quiz: Quiz 1':'Quiz1',

'Quiz: Quiz 2':'Quiz2',

'URL: Activity-Self assessment on Data Analytics Processes':'URL',

'Binning Label':'MarksBin'},

inplace=True)

print(dfb.columns)

Convert StudentID To Numeric Value

dfb['StudentID'] = dfb['StudentID'].str[1:]

dfb['StudentID'] = pd.to_numeric(dfb['StudentID'])

dfb.info()

Selected Data

dfb1 = dfb[['StudentID', 'Assignment', 'ProjectSubmission',

'AccessFile', 'Forum1', 'Forum2', 'Forum3', 'Forum4', 'Activity1',

'Activity2', 'Activity3', 'Activity4', 'Page1', 'Page2', 'Page3',

'Page4', 'Page5', 'Page6', 'Page7', 'Page8', 'Questionnaire', 'Quiz1',

'Quiz2', 'URL', 'MarksBin']]

dfb1.info()

Describe Basic Statistical Value

dfb1.describe()

The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

Histogram

dfb1.hist(figsize=(16, 15), bins=50, xlabelsize=8, ylabelsize=8); # ; avoid having the matplotlib verbose informations

To create a histogram, we use the pandas hist() method. Calling the hist() method on a pandas data frame will return histograms for all non-nuisance series in the data frame. A histogram allows us to see the distribution for all non-nuisance series variable.

Scatter Plot

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. We use scatter plots to observe relationships between variables.

for i in range(0, len(dfb1.columns), 10):

sns.pairplot(data=dfb1,

x_vars=dfb1.columns[i:i+5],

y_vars=['MarksBin'])

Heatmap

corr() is used to find the pairwise correlation of all columns in the data frame. Any null values are automatically excluded. For any non-numeric data type columns in the data frame, it is ignored.

dfb1_corr = dfb1.corr()['MarksBin'][:-1] # -1 because the last column is MarksBin

golden_features_list = dfb1_corr[abs(dfb1_corr) > 0.2].sort_values(ascending=False)

print("There is {} strongly correlated values with MarksBin:\n{}".format(len(golden_features_list), golden_features_list))

For this dataset, there are 5 variables that have correlation with more than 0.2.

corr = dfb1.drop('MarksBin', axis=1).corr() # We already examined MarksBin correlations

plt.figure(figsize=(12, 10))

sns.heatmap(corr[(corr >= 0.2) | (corr <= -0.1)],

cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,

annot=True, annot_kws={"size": 8}, square=True);

Correlation is a statistic that measures the degree to which two variables move in relation to each other.

EDA Before Pre-processing

EDA Before Pre-processing

Introduction To Dataset

Data-Binning Dataset

Importing Data

Observe The Data

Rename Column

Convert StudentID To Numeric Value

Selected Data

Describe Basic Statistical Value

Histogram

Scatter Plot

Heatmap

Next Topic: Exploratory Data Analysis After Pre-processing

Copyright by 199607-Build using sites.google.com