Clustering or cluster analysis is a machine learning technique, which groups the unlabeled dataset. It can be defined as "Away of grouping the data points into different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that has less or no similarities with another group
Plotting a Heat map for correlation between columns, which is described above by wine_df.discribe() command.
The Heat map describes the quality column contains highest value and there doping it in the next command.
Drop the quality table as the clustering data is applied to find out the quality of wines.
The required data set should not contain “Quality” column
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
Here we are taking the n_cluster = i, performing on the data set of wine_df.inertia_ with the help to the elbow method.
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The value of the silhouette ranges between [1, -1], where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
Dimensionality reduction refers to techniques for reducing the number of input variables in training data. When dealing with high dimensional data, it is often useful to reduce the Dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.
We are using PCA technique for the reduction of data set.
X = pca.fit_transform(wine_df)
We train our model on K Means and Silhouette cloistering.
Now we divide the data into clusters.