Given a group of objects, the purpose of clustering is to discover groups of objects such that objects within a group are similar but the objects between the groups are dissimilar
Number of clusters depends on which attribute or combination of attributes are being used for differentiating the objects
An object (Each document) can only be in one cluster
The clusters are mutually exclusive
A data mining technique that are normally used for exploratory purposes to discover any patterns that could appear in the natural grouping of objects
It is unsupervised learning technique, in which no pre-defined labels are used in the mining process
Text Mining
Clustering divides the corpus of documents mutually exclusive groups based on the presence of similar themes
Applications in Text Mining
Documents Grouping
Grouping of documents in a corpus collection based similar theme or topic
Topics Grouping
Documents in a corpus may have similar sets of words about the same topic
Clustering – Similarity Metrics
Distance based method
Dissimilarity is conceptualized as the distance between objects
Commonly used is the Euclidean distance between 2 points in the Euclidean space
Correlation based method
High correlations indicate similarity (correspondence of patterns across variables) between objects (documents)
Commonly used is cosine similarity
Correlation by Cosine:
Convert each word into a vector
Use Cosine distance (the angle between 2 points) to find distance between 2 words vector
Association based method
Assesses the degree of agreement or matching between two objects (E.g. documents)
Based on matching coefficient which measures % of times that 2 objects match.
Types of Clustering
Hierarchical Clustering
Structure like a tree or parent child
Two most similar clusters are combined and continue to combine until all objects are in same clusters
Agglomerative (Bottom up etc.)
Initially, each point is a cluster
Repeatedly combine the two “nearest” clusters into one
Divisive (Top down etc.)
Start with one cluster and recursively split it
Typically uses Euclidean distance
Dendrogram (Maximum height between 2 levels)
Partitioning Clustering
K-Means Clustering
A collection of objects which are “similar” between them and “dissimilar” to objects belonging to other clusters
A division of objects into clusters such that each object is in exactly one cluster.
Each cluster is associated with a centroid (centre point)
Each point is assigned to the cluster with the closest centroid