In this activity, we used a Data-Binning dataset; which is the original dataset that has been pivot by Student ID and event context but not yet grouping by the same event context. This dataset contains 35 rows with 27 columns.
pathd = "/content/drive/My Drive/Colab Notebooks/Data-Binning.xlsx"
dfb = pd.read_excel(pathd)
dfb.head(3)
Using the pd.read_excel() function will bring the excel file into a pandas DataFrame, with the lines parameter as True- because every new object is separated by a new line.
dfb.info()
By using dfb_info() we can observe the datatype, the total number of rows, the number of rows with a non-null value. For this dataset, we can see that there is no missing value and out of 27 columns, there are only two columns with object datatype. Other than that, three columns with int64 datatype and twenty-two columns are float64 type.
Firstly, we handle the naming of columns for ease-of-access in pandas.
dfb.rename(columns={'Student ID': 'StudentID',
'Assignment: Assignment 2':'Assignment',
'Assignment: Capstone Project Submission':'ProjectSubmission',
'File: Guide for Data Preprocessing':'AccessFile',
'Forum: Activity - Build Your First Dashboard!':'Forum1',
'Forum: Activity - Reflection of Experience':'Forum2',
'Forum: Activity - Share your first RapidMiner project experience!':'Forum3',
'Forum: Activity 1- Data Analytic in Action':'Forum4',
'Interactive Content: Activity - Find the Pairs!':'Activity1',
'Interactive Content: Activity - Supervised and Unsupervised Machine Learning Video with Interactive Quiz':'Activity2',
'Interactive Content: Activity-Pair the Machine Learning Algo with Suitable Concepts':'Activity3',
'Interactive Content: Activity-Terminology':'Activity4',
'Page: Activity - Your Data Analytics Apps Experience':'Page1',
'Page: Activity -Test Your Understanding - Quizziz':'Page2',
'Page: Lecture Notes - Data Preprocessing':'Page3',
'Page: Lecture Notes - Exploratory Data Analysis':'Page4',
'Page: Lecture Notes - Introduction to Data Analytics':'Page5',
'Page: Lecture Notes - Machine Learning':'Page6',
'Page: Tutorial - Using RapidMiner for Data Analytics':'Page7',
'Page: Tutorial on Exploratory Data Analysis using Power BI':'Page8',
'Questionnaire: Self Assessment':'Questionnaire',
'Quiz: Quiz 1':'Quiz1',
'Quiz: Quiz 2':'Quiz2',
'URL: Activity-Self assessment on Data Analytics Processes':'URL',
'Binning Label':'MarksBin'},
inplace=True)
print(dfb.columns)
dfb['StudentID'] = dfb['StudentID'].str[1:]
dfb['StudentID'] = pd.to_numeric(dfb['StudentID'])
dfb.info()
dfb1 = dfb[['StudentID', 'Assignment', 'ProjectSubmission',
'AccessFile', 'Forum1', 'Forum2', 'Forum3', 'Forum4', 'Activity1',
'Activity2', 'Activity3', 'Activity4', 'Page1', 'Page2', 'Page3',
'Page4', 'Page5', 'Page6', 'Page7', 'Page8', 'Questionnaire', 'Quiz1',
'Quiz2', 'URL', 'MarksBin']]
dfb1.info()
dfb1.describe()
The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
dfb1.hist(figsize=(16, 15), bins=50, xlabelsize=8, ylabelsize=8); # ; avoid having the matplotlib verbose informations
To create a histogram, we use the pandas hist() method. Calling the hist() method on a pandas data frame will return histograms for all non-nuisance series in the data frame. A histogram allows us to see the distribution for all non-nuisance series variable.
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. We use scatter plots to observe relationships between variables.
for i in range(0, len(dfb1.columns), 10):
sns.pairplot(data=dfb1,
x_vars=dfb1.columns[i:i+5],
y_vars=['MarksBin'])
corr() is used to find the pairwise correlation of all columns in the data frame. Any null values are automatically excluded. For any non-numeric data type columns in the data frame, it is ignored.
dfb1_corr = dfb1.corr()['MarksBin'][:-1] # -1 because the last column is MarksBin
golden_features_list = dfb1_corr[abs(dfb1_corr) > 0.2].sort_values(ascending=False)
print("There is {} strongly correlated values with MarksBin:\n{}".format(len(golden_features_list), golden_features_list))
For this dataset, there are 5 variables that have correlation with more than 0.2.
corr = dfb1.drop('MarksBin', axis=1).corr() # We already examined MarksBin correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr[(corr >= 0.2) | (corr <= -0.1)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
Correlation is a statistic that measures the degree to which two variables move in relation to each other.