The Text Import node converts collections of documents (Corpus of PDFs) into a single a single SAS table
A corpus is a pool of documents used for analysis
Each documents then becomes a row in the SAS data set
Supported document types:
Microsoft Word
Microsoft Excel
Microsoft PowerPoint
Adobe Acrobat (PDF)
To create a Text Import node for text document
Right click on the Text Import Node:
To run the text Import Node from the source file
To view the result
There were no omitted or truncated files.
Files are omitted because they are of wrong format
Files are truncated beyond max size defined
The Output window indicates that 3,038 documents were processed
To view the data files that are loaded by Text Import
The above steps shows how 3038 documents read are of .txt file format are loaded into a SAS table.
Web Crawler
A web crawler is a software program that traverses pages on the Internet leverages the embedded linkage structure through which web pages are interconnected
Web crawlers require a web page link to begin crawling (also known as a seed page or entry point)
2 modes of crawling:
Traversing Path of a Web Crawler in Breadth First Mode
Traversing Path of a Web Crawler in Depth First Mode
Import File Directory
The files from the Web will be downloaded to this folder
Destination Directory
The Text Import node will process files from File Import directory and placed the processed files in this folder
The text from the website
File Import
Each row of feedback is taken as a row (or document) in the SAS table
In order for the rows in the file to be extract as individual rows (or files) in SAS Text Miner environment, File Import node is used