RyanGoh's Eportfolio

1. Collect Data

Collection of Data from:
- Documents (doc, pdf etc.)
- Web pages (Textual files, XMLs)
- User comments

Text Import

The Text Import node converts collections of documents (Corpus of PDFs) into a single a single SAS table
A corpus is a pool of documents used for analysis
Each documents then becomes a row in the SAS data set
Supported document types:
- Microsoft Word
- Microsoft Excel
- Microsoft PowerPoint
- Adobe Acrobat (PDF)

To create a Text Import node for text document

Right click on the Text Import Node:
- To run the text Import Node from the source file
- To view the result

There were no omitted or truncated files.
- Files are omitted because they are of wrong format
- Files are truncated beyond max size defined
The Output window indicates that 3,038 documents were processed

To view the data files that are loaded by Text Import

The above steps shows how 3038 documents read are of .txt file format are loaded into a SAS table.

Web Crawler

A web crawler is a software program that traverses pages on the Internet leverages the embedded linkage structure through which web pages are interconnected
Web crawlers require a web page link to begin crawling (also known as a seed page or entry point)
2 modes of crawling:
- Traversing Path of a Web Crawler in Breadth First Mode
- Traversing Path of a Web Crawler in Depth First Mode

Import File Directory
- The files from the Web will be downloaded to this folder
Destination Directory
- The Text Import node will process files from File Import directory and placed the processed files in this folder

The text from the website

File Import

Each row of feedback is taken as a row (or document) in the SAS table

In order for the rows in the file to be extract as individual rows (or files) in SAS Text Miner environment, File Import node is used

To create a File Import node for text document

The results from File Import

Google Sites

Report abuse