- Text Topic
- A topic is a subject or theme or idea that occurs in a document
- A document can contain more than one topic (A document can belong to one cluster)
- Topics are generated by:
- Automatically by the Text Topic node using the same underlying mathematical algorithm that Text Cluster uses
- By user definitions
- Concept Link vs Text Topic
- Concept Links
- Help in understanding the relationships between words based on the co-occurrence of words in the documents
- Text Topic Node
- The Text Topic node in SAS Text Miner discovers topics from text
- The node analyzes document contents and summarizes the collection by identifying topics
- In a text topic analysis, each document can belong to many topics
- The Text Topic node enables an analyst to create topics of interest using groups of terms identified in text parsing
- The Text Topic node must be preceded by either the Text Parsing or Text Filter node
- When the Text Topic node is connected to a Text Filter node, the term weighting properties set in the Text Filter node are used by the Text Topic node
- To find terms that occur frequently together within documents
- The node can be configured to identify single term topics or multi term topics in the data
- Topic extraction is a computer intensive task because the node uses rotated SVD in the background to capture information from a sparse term by document matrix
- Example:
- There are four documents
- Plotting the terms in the SVD space shows that the terms “iPad”, “kids,” and “love” are close
- The terms occurring only once in the text like “great” and “OSU,” are far from the other terms
- The terms “iPad,” “kids,” and “love” can be combined to form a topic.
- Similarly, the terms “love,” “is,” and “I” can be combined to form another topic
- Text Topic Properties
- By default, the node doesn’t generate any single term topics
- User can specify this using the "Number of Single term Topics" setting in the Properties panel
- The node, by default, creates 25 multi term topics, where each topic is essentially an SVD dimension
- User can always modify this number using the property setting "Number of Multi term Topics"
- Term topic weight:
- Each term is assigned a weight corresponding to each topic
- If there are 25 topics extracted there will be 25 term topic weights calculated for a single term
- Document topic weight:
- Every document in the collection is assigned a weight corresponding to each topic
- If there are 25 topics extracted, there will be 25 document topic weights calculated for a single document
- Term topic weights and document topic weights are used to calculate cutoff scores for each multi term topic
- Term cutoff:
- This is the threshold score that determines whether a term belongs to a topic
- Any term with an absolute term topic weight greater than this cutoff is assigned to this topic
- Document cutoff:
- This is the threshold score that determines whether a document belongs to a topic
- Any document with an absolute document topic weight greater than this cutoff is assigned to this topic
- Select the Text Topic node
- 3 windows are displayed:
- The top Topic window shows the topics displayed (each row is a topic described by a string of terms) which include the "Term Cut-off" and "Document Cut-off"
- The Middle window show the Term window which displays terms in the Topic selected in the first window
- The Bottom window shows the Document window which displays the list documents classified under the topic selected in the Topic window
- Create data sources rom the SAS shared data library
- User Topics
- Select User Topics on the customized Text Topic node using “User Topics”