Some form of analysis is performed to do the following:
Identify or define homogeneous groups, clusters, or segment
Find links or associations between entities, as in market basket analysis
Supervised Learning (Predictive Analysis / To do prediction)
A target variable is used
Some form of predictive or classification model is developed
Input variables are associated with values of a target variable, and the model produces a predicted target value for a given set of inputs
Unsupervised Learning
Clustering
Natural grouping a set of objects
Objects in the same group (cluster) are more similar to each other than to those in other clusters
Association Analysis
Pattern where one event (occurrence) is connected to another event
Path or Link Analysis
Link Analysis deals with mining useful information from linked structures like graphs
Graphs have vertices representing objects and links among those vertices representing relationships among those objects
Topic or Concept Analysis
The discovery of key ideas or topics in a corpus of documents
Supervised Learning
Classification
It is a type of predictive analysis to identify new patterns or events and categorize them in a pre-defined classification
The target variable is categorical
Regression
It is a type of predictive analysis to identify new patterns or events
The target variable is numerical
Types of data
Structured
Refers to data that is identifiable because it is organized in a structure format
Structured data are normally numerical and presented in spreadsheet format (rows and columns)
Unstructured
Refers to data that has no identifiable structure
Examples include images, videos, email, documents and text
Before any unstructured data can be analyzed, it usually needs to be transformed into some form of structured numerical representation before normal data mining techniques can be applied
Text Mining
Is used to denote any system that analyses large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract useful information
Is the process or task of transforming unstructured text data into structured numerical data so that automatic algorithms can be applied to large document databases
Is also the process that involve text data search/collection, data cleansing and transformation of text into suitable form (structured) ready for analysis
Text Mining Applications
Information retrieval
Finding documents with relevant content of interest
Document categorization
Clustering documents into naturally occurring groups
Extracting themes or concepts
Anomaly detection
Identifying unusual documents that might be associated with cases requiring special handling
Sentiment Analysis
Finding the sentiment polarity (E.g. positive or negative feelings)
Text Mining Practice Areas
Informational Retrieval (IR)
Study of searching and retrieving a subset of documents from a universe of document collections in response to a search query
Document Classification
The process of finding commonalities in the documents in a corpus and grouping them into predetermined labels (supervised learning) based on the topical themes
Task is to assign a document to one or more classes or categories based on based some rules, E.g. classification rules
Supervised learning technique uses predetermined labels or categories
Information Extraction
The process of extracting fragments of data such as the names of people, organizations, places, addresses, dates, times, etc., from documents
Document clustering
Clustering partitions objects ( eg , like terms of words) in a data set into groups so that the objects within a group (cluster) are similar and the objects between the groups are dissimilar
Web mining
The use of data mining techniques to discover and extract information from Worldwide Web (www) or internet
Natural Language Processing
The ability of a computer program to understand human speech as it is spoken or text as it is written
3 major aspects of Natural language processing:
Syntax
The describes the form of the language, i.e. the grammar.
Semantics
The study of meaning and interpretation of words, and sentences in a language.
Pragmatics
Explains how the sentence relates to the world.
Take into account the context of the sentence, the state of the world, the goals of the speaker and the listener
Concept Extraction
The technique of mining the most important topic or concept of a document
Text Analytics Process
Collect Data
Relevant raw data are collected based on the business objectives
Text Parsing
Extract words
Extraction of words from the corpus and cleanse (E.g. white spaces, symbols like <html>, unknown symbols “#”)
A corpus refers to a collection or a set of documents
Parts of Speech
Determine Parts of Speech (POS) for each word using NLP
E.g. Determine the word is an adjective, noun, verb
Stemming
Stemming of words. E.g. “run”, “ran” is normalized to “run”
Word filtering
Remove words with little information. E.g. “the”, “you”, “run”
Synonyms
Normalize words of same meaning. E.g. “vehicle” and “car” normalize to “car”
Text Filtering
Filter irrelevant terms
Further reduction of terms needed, and customization of terms based on a specific business domain may be needed.
Create custom start/stop lists
Transformation
Calculate term counts
Create Term-By-Document Matrix
Containing key words and their weights that can represent the guise information and concepts of the corpus of text documents under analysis.