4.1 Text Clustering

Clustering

Given a group of objects, the purpose of clustering is to discover groups of objects such that objects within a group are similar but the objects between the groups are dissimilar
Number of clusters depends on which attribute or combination of attributes are being used for differentiating the objects
An object (Each document) can only be in one cluster
The clusters are mutually exclusive
A data mining technique that are normally used for exploratory purposes to discover any patterns that could appear in the natural grouping of objects
It is unsupervised learning technique, in which no pre-defined labels are used in the mining process
Text Mining
- Clustering divides the corpus of documents mutually exclusive groups based on the presence of similar themes

Applications in Text Mining

Documents Grouping
- Grouping of documents in a corpus collection based similar theme or topic
Topics Grouping
- Documents in a corpus may have similar sets of words about the same topic

Clustering – Similarity Metrics

Distance based method
- Dissimilarity is conceptualized as the distance between objects
- Commonly used is the Euclidean distance between 2 points in the Euclidean space
Correlation based method
- High correlations indicate similarity (correspondence of patterns across variables) between objects (documents)
- Commonly used is cosine similarity
- Correlation by Cosine:
  - Convert each word into a vector
  - Use Cosine distance (the angle between 2 points) to find distance between 2 words vector
Association based method
- Assesses the degree of agreement or matching between two objects (E.g. documents)
- Based on matching coefficient which measures % of times that 2 objects match.

Types of Clustering

Hierarchical Clustering
- Structure like a tree or parent child
- Two most similar clusters are combined and continue to combine until all objects are in same clusters
- Agglomerative (Bottom up etc.)
  - Initially, each point is a cluster
  - Repeatedly combine the two “nearest” clusters into one
- Divisive (Top down etc.)
  - Start with one cluster and recursively split it
  - Typically uses Euclidean distance
- Dendrogram (Maximum height between 2 levels)
Partitioning Clustering
- K-Means Clustering
  - A collection of objects which are “similar” between them and “dissimilar” to objects belonging to other clusters
  - A division of objects into clusters such that each object is in exactly one cluster.
  - Each cluster is associated with a centroid (centre point)
  - Each point is assigned to the cluster with the closest centroid
  - Number of clusters, K, must be specified
- Typically uses Euclidean
- Popular for numeric data clustering
Probabilistic Clustering (E.g. Topic modelling Clustering)
- Expectation Maximization
  - Estimates probabilities of each object belonging to each cluster
  - Goal is to maximize the overall probability or likelihood of an object in each cluster.
  - Comprises of 2 steps:
    1. Step 1: Expectation (E)
      - Similar to K means clustering
      - Object is given a weight or probability in each cluster
    2. Step 2: Maximization (M)
      - Clustering continues until the weighted averages do not change significantly
- Typically uses Mahalanobis distance

SVD - Expectation Maximization (SVD Resolution: Low)

SVD - Hierarchical (SVD Resolution: Low)

Google Sites

Report abuse