2. Text Mining and Analytics

Data Mining (DM)
- Refers to the extraction or “mining” of knowledge from large amount of data
- The knowledge comes from the discovery of patterns in these data which are useful for decision making and predicting the future
- 2 Broad Approaches
  1. Unsupervised Learning (Exploratory Analysis / pattern discovery)
    - There is no target variable
    - Some form of analysis is performed to do the following:
      - Identify or define homogeneous groups, clusters, or segment
      - Find links or associations between entities, as in market basket analysis
  2. Supervised Learning (Predictive Analysis / To do prediction)
    - A target variable is used
    - Some form of predictive or classification model is developed
    - Input variables are associated with values of a target variable, and the model produces a predicted target value for a given set of inputs

Unsupervised Learning
1. Clustering
  - Natural grouping a set of objects
  - Objects in the same group (cluster) are more similar to each other than to those in other clusters
2. Association Analysis
  - Pattern where one event (occurrence) is connected to another event
3. Path or Link Analysis
  - Link Analysis deals with mining useful information from linked structures like graphs
  - Graphs have vertices representing objects and links among those vertices representing relationships among those objects
4. Topic or Concept Analysis
  - The discovery of key ideas or topics in a corpus of documents

Supervised Learning
1. Classification
  - It is a type of predictive analysis to identify new patterns or events and categorize them in a pre-defined classification
  - The target variable is categorical
2. Regression
  - It is a type of predictive analysis to identify new patterns or events
  - The target variable is numerical

Types of data
1. Structured
  - Refers to data that is identifiable because it is organized in a structure format
  - Structured data are normally numerical and presented in spreadsheet format (rows and columns)
2. Unstructured
  - Refers to data that has no identifiable structure
  - Examples include images, videos, email, documents and text
  - Before any unstructured data can be analyzed, it usually needs to be transformed into some form of structured numerical representation before normal data mining techniques can be applied

Text Mining
- Is used to denote any system that analyses large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract useful information
- Is the process or task of transforming unstructured text data into structured numerical data so that automatic algorithms can be applied to large document databases
- Is also the process that involve text data search/collection, data cleansing and transformation of text into suitable form (structured) ready for analysis

Text Mining Applications
1. Information retrieval
  - Finding documents with relevant content of interest
2. Document categorization
  - Clustering documents into naturally occurring groups
  - Extracting themes or concepts
3. Anomaly detection
  - Identifying unusual documents that might be associated with cases requiring special handling
4. Sentiment Analysis
  - Finding the sentiment polarity (E.g. positive or negative feelings)

Text Mining Practice Areas
1. Informational Retrieval (IR)
  - Study of searching and retrieving a subset of documents from a universe of document collections in response to a search query
2. Document Classification
  - The process of finding commonalities in the documents in a corpus and grouping them into predetermined labels (supervised learning) based on the topical themes
  - Task is to assign a document to one or more classes or categories based on based some rules, E.g. classification rules
  - Supervised learning technique uses predetermined labels or categories
3. Information Extraction
  - The process of extracting fragments of data such as the names of people, organizations, places, addresses, dates, times, etc., from documents
4. Document clustering
  - Clustering partitions objects ( eg , like terms of words) in a data set into groups so that the objects within a group (cluster) are similar and the objects between the groups are dissimilar
5. Web mining
  - The use of data mining techniques to discover and extract information from Worldwide Web (www) or internet
6. Natural Language Processing
  - The ability of a computer program to understand human speech as it is spoken or text as it is written
  - 3 major aspects of Natural language processing:
    1. Syntax
      - The describes the form of the language, i.e. the grammar.
    2. Semantics
      - The study of meaning and interpretation of words, and sentences in a language.
    3. Pragmatics
      - Explains how the sentence relates to the world.
      - Take into account the context of the sentence, the state of the world, the goals of the speaker and the listener
7. Concept Extraction
  - The technique of mining the most important topic or concept of a document

Text Analytics Process
1. Collect Data
  - Relevant raw data are collected based on the business objectives
2. Text Parsing
  1. Extract words
    - Extraction of words from the corpus and cleanse (E.g. white spaces, symbols like <html>, unknown symbols “#”)
    - A corpus refers to a collection or a set of documents
  2. Parts of Speech
    - Determine Parts of Speech (POS) for each word using NLP
    - E.g. Determine the word is an adjective, noun, verb
  3. Stemming
    - Stemming of words. E.g. “run”, “ran” is normalized to “run”
  4. Word filtering
    - Remove words with little information. E.g. “the”, “you”, “run”
  5. Synonyms
    - Normalize words of same meaning. E.g. “vehicle” and “car” normalize to “car”
3. Text Filtering
  - Filter irrelevant terms
    - Further reduction of terms needed, and customization of terms based on a specific business domain may be needed.
  - Create custom start/stop lists
4. Transformation
  - Calculate term counts
  - Create Term-By-Document Matrix
    - Containing key words and their weights that can represent the guise information and concepts of the corpus of text documents under analysis.
  - Calculate SVDs
5. Text Mining (This process is iterative)
  - Topic Extraction
  - Clustering
  - Link Analysis
  - Predictive Analytics
  - Boolean Rules

Google Sites

Report abuse