The key purpose of the data preparation is to transform text (unstructured) to numbers (structured) so that normal data mining techniques can be applied for analysis
The numerical representation of the text takes the form of a spreadsheet like structure called a Term By Document matrix
In TBDM, dimensions (rows/columns) are determined by the number of documents and number of terms in the corpus
Observations about entries in the TBDM:
The size of a TBDM can grow exponentially with more documents and more terms being extracted
Non-negative numbers
High dimensionality (E.g. Many rows and columns)
A good approach to identify significant terms in a corpus is to assign a weight (Weighted Frequency) to each entry in the TBDM
The weights are normally derived based on term frequency document frequency and size of the corpus
Term Frequency is the number of occurrences the term appears in a document
Document Frequency is the number of documents in which the term appears
Size of corpus is the number of documents in the corpus
The Weighted Frequency of a term in a corpus of documents of the TBDM is determined by its:
Weighted Frequency (Term weight) = Local weight x Global weight
Local weight
Local Weight (frequency weight) is the transformed weight of the term within a document
It is derived by using a function on the raw frequency of the term in the document
Terms that occur very often in a document are assigned a higher weight and are considered important because they best describe the document
Log, Binary, None
Log -> ( log (frequency + 1)
Binary -> (1 for term present in document otherwise 0)
Global weight
Global weight (term weight) is assigned to the term based on the overall frequency and document frequency within a corpus
In SAS text miner, three common techniques to determine the term weight of a term (word):
Entropy
Entropy is concept related to information gain
In text mining, entropy gives a concept how informative is a term in the corpus
The terms that have the high global weights are those that appears in high frequency within a document but only in a few documents
The terms that have the low global weights are those that appear in high frequency within a document and also in many documents
Small document
Inverse Document Frequency (IDF)
IDF calculates importance as the inverse of the frequency of occurrence of a term in the corpus of documents
A term that appears infrequently is considered more important and is given a higher score or (weight), whereas a term with a high frequency of appearance is considered less important and is given a lower score
An IDF weight value of zero indicates that the word appears in all documents and has an insignificant effect on discriminating the documents
Huge document
Mutual Information
Mutual information can be derived as the difference between the entropy of variable X and the entropy of variable Y with the condition that X knows Y
In text mining, this can be thought of as a measure of how much the presence of one term in a document tells us about whether the document belongs to a particular category
When there is a target variable, classification etc.
Filter Node
Interactive Filter Viewer
A feature of the Text Filter node to refine the results from parsing and filtering
Using the interactive filter viewer, user can browse through all of the parsed terms and manually modify the list by dropping or keeping
Concept Links
Concept links help in understanding the relationships between words based on the co-occurrence of words in the documents
The first number represents the number of documents in which the two terms co-occur, and the second number represents the total number of documents in which the specific term occurs
From the concept linking, it can be noticed that "image quality" has a strong relationship with "image"
And it can also be noticed that "image" has a strong relationship with "color". We could derived that due to the color, it enhanced the image and therefore, image highly associated with image quality
Singular Value Decomposition (SVD)
The use of Singular Value Decomposition (SVD) is called the Linear Algebra Approach to Text Mining, Information Retrieval, Web Analytics etc.
The algebraic operation is the foundation of an approach that has many names such as:
Latent Semantic Indexing (LSI)
Latent Semantic Analysis (LSA)
Vector Space Model (VSM)
All these approaches are NLP (Natural Language Processing) techniques that use the SVD mathematics to determine or identify the relationships between terms or concepts / topics
Underlying assumption is that terms close in meaning will frequently occur in same sets of documents
SVD is used in measuring correlation between objects
SVD values are used for predictive model in Text Classification
SVD Example:
Dimension Reduction using SVD
SVD Resolution value:
High = 100%
High resolution means you keep more SVDs / Dimension ie more granulous
Medium = 5/6 = 83.3%
Low = 2/3 = 66.67% (the default)
Low resolution means retain fewer SVDs /Dimensions is more summarized
The SVD resolution can be any number between 2 to 500. A higher number generated better data summary but takes more computing power to finish
Value of ‘K’ in SVD computation
SVD computation is memory intensive
For large dataset, SVD computation may use random sampling instead of full dataset to avoid running out of memory
Value of K (Max SVD Dimensions)
A high value of K gives better proximation of actual matrix size
Too high value of K will result in high no of dimensions, intensive for modelling purpose and may result in noise
Generally, values of 20 – 200 are appropriate as the initial first k singular values to be calculated
K value of 2 to 50 are appropriate for clustering
K value of 30 to 200 are possible for prediction or classification