3. Transformation

Transformation

Calculate Term counts
Create Term-By-Document Matrix
Calculate SVDs

Term By Document Matrix (TBDM)

The key purpose of the data preparation is to transform text (unstructured) to numbers (structured) so that normal data mining techniques can be applied for analysis
The numerical representation of the text takes the form of a spreadsheet like structure called a Term By Document matrix
In TBDM, dimensions (rows/columns) are determined by the number of documents and number of terms in the corpus
Observations about entries in the TBDM:
1. The size of a TBDM can grow exponentially with more documents and more terms being extracted
2. Non-negative numbers
3. High dimensionality (E.g. Many rows and columns)
A good approach to identify significant terms in a corpus is to assign a weight (Weighted Frequency) to each entry in the TBDM
- The weights are normally derived based on term frequency document frequency and size of the corpus
- Term Frequency is the number of occurrences the term appears in a document
- Document Frequency is the number of documents in which the term appears
- Size of corpus is the number of documents in the corpus
The Weighted Frequency of a term in a corpus of documents of the TBDM is determined by its:
1. Weighted Frequency (Term weight) = Local weight x Global weight
2. Local weight
  - Local Weight (frequency weight) is the transformed weight of the term within a document
  - It is derived by using a function on the raw frequency of the term in the document
  - Terms that occur very often in a document are assigned a higher weight and are considered important because they best describe the document
  - Log, Binary, None
    - Log -> ( log (frequency + 1)
    - Binary -> (1 for term present in document otherwise 0)
3. Global weight
  - Global weight (term weight) is assigned to the term based on the overall frequency and document frequency within a corpus
  - In SAS text miner, three common techniques to determine the term weight of a term (word):
    1. Entropy
      - Entropy is concept related to information gain
      - In text mining, entropy gives a concept how informative is a term in the corpus
      - The terms that have the high global weights are those that appears in high frequency within a document but only in a few documents
      - The terms that have the low global weights are those that appear in high frequency within a document and also in many documents
      - Small document
    2. Inverse Document Frequency (IDF)
      - IDF calculates importance as the inverse of the frequency of occurrence of a term in the corpus of documents
      - A term that appears infrequently is considered more important and is given a higher score or (weight), whereas a term with a high frequency of appearance is considered less important and is given a lower score
      - An IDF weight value of zero indicates that the word appears in all documents and has an insignificant effect on discriminating the documents
      - Huge document
    3. Mutual Information
      - Mutual information can be derived as the difference between the entropy of variable X and the entropy of variable Y with the condition that X knows Y
      - In text mining, this can be thought of as a measure of how much the presence of one term in a document tells us about whether the document belongs to a particular category
      - When there is a target variable, classification etc.

Filter Node

Interactive Filter Viewer
- A feature of the Text Filter node to refine the results from parsing and filtering
- Using the interactive filter viewer, user can browse through all of the parsed terms and manually modify the list by dropping or keeping

Concept Links

Concept links help in understanding the relationships between words based on the co-occurrence of words in the documents
The first number represents the number of documents in which the two terms co-occur, and the second number represents the total number of documents in which the specific term occurs

From the concept linking, it can be noticed that "image quality" has a strong relationship with "image"

And it can also be noticed that "image" has a strong relationship with "color". We could derived that due to the color, it enhanced the image and therefore, image highly associated with image quality

Singular Value Decomposition (SVD)

The use of Singular Value Decomposition (SVD) is called the Linear Algebra Approach to Text Mining, Information Retrieval, Web Analytics etc.
The algebraic operation is the foundation of an approach that has many names such as:
1. Latent Semantic Indexing (LSI)
2. Latent Semantic Analysis (LSA)
3. Vector Space Model (VSM)
All these approaches are NLP (Natural Language Processing) techniques that use the SVD mathematics to determine or identify the relationships between terms or concepts / topics
Underlying assumption is that terms close in meaning will frequently occur in same sets of documents
SVD is used in measuring correlation between objects
SVD values are used for predictive model in Text Classification
SVD Example:

Dimension Reduction using SVD
- SVD Resolution value:
  - High = 100%
    - High resolution means you keep more SVDs / Dimension ie more granulous
  - Medium = 5/6 = 83.3%
  - Low = 2/3 = 66.67% (the default)
    - Low resolution means retain fewer SVDs /Dimensions is more summarized
- The SVD resolution can be any number between 2 to 500. A higher number generated better data summary but takes more computing power to finish
Value of ‘K’ in SVD computation
- SVD computation is memory intensive
- For large dataset, SVD computation may use random sampling instead of full dataset to avoid running out of memory
Value of K (Max SVD Dimensions)
- A high value of K gives better proximation of actual matrix size
- Too high value of K will result in high no of dimensions, intensive for modelling purpose and may result in noise
- Generally, values of 20 – 200 are appropriate as the initial first k singular values to be calculated
- K value of 2 to 50 are appropriate for clustering
- K value of 30 to 200 are possible for prediction or classification

Cluster Result (Example)

Application of SVD

Google Sites

Report abuse