Descriptive analytics looks at data statistically to tell what happened in the past. Descriptive analytics helps a business understand how it is performing by providing context to help stakeholders interpret information. This can be in the form of data visualizations like graphs, charts, reports, and dashboards. It is a statistical method that is used to search and summarize historical data in order to identify patterns or meaning. In this project, we use K-Means Clustering and Agglomerative Hierarchical Clustering using Rapidminer and Python.
In this activity, we used StudentEvent dataset which original dataset has been pivot by Student ID and event context and also has been grouping by the same event context.
Since clustering algorithms including k-Means use distance-based measurements to determine the similarity between data points, it’s recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the features in any dataset would have different units of measurements. For that reason, we standardized our dataset in Rapidminer to be analyzed in Rapidminer and also in Python.
Dataset before standardization
Dataset after standardization
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. K-Means clustering is a clustering algorithm that aims to partition n observations into k clusters. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable k. The algorithm works iteratively to assign each data point to one of the k groups based on the features that are provided. Data points are clustered based on feature similarity.
In K-Means Algorithm, the most important thing is to find the optimum k-value. Below are the steps that we apply in our project.
Agglomerative Clustering is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Agglomerative Clustering is one of hierarchical clustering. It is also called hierarchical cluster analysis or HCA is a method of cluster analysis that seeks to build a hierarchy of clusters. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
Initially, each data point is considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until 1/ K clusters are formed. The main advantage is that we don’t need to specify the number of clusters,
The Hierarchical Clustering Technique can be visualized using a Dendrogram. A Dendrogram is a tree-like diagram that records the sequences of merges or splits.