*** work in progress ***
In this topic I will show some examples of Text Processing with KNIME. The examples are based on the L4 - Introduction to Text Processing course.
In the first workflow, we will read a Table containing restaurant reviews from Tripadvisor and convert these into Documents.
In KNIME, a "document" is a data type used to represent textual information, similar to how tables represent structured data. It's not a physical file or a specific format in the traditional sense, but rather a representation of text that can be processed and manipulated within KNIME workflows.
The "Document Viewer" node is quite useful to inspect our documents. Below there are screenshots of the Document Viewer showing the list of documents and an individual Document. This will become more interesting when we start enriching and transforming our Documents.
In the next workflow, we will enrich our documents using the POS tagger node. POS means Part of Speech. It will tag each term of a document using the Penn Treebank tag set. (click here for a list of them)
Below you see a screenshot of a Tagged Document as shown in the Tagged Document Viewer.
Imagine we want to classify our restaurant reviews into 2 categories : positive and negative reviews.
For this, we can also use Tagging. But in the following workflow we will use a slightly different approach. Instead of using the POS Tagger, we will use the Dictionary Tagger node. This node draws from a table of Terms (in this case coming from .CSV files). In this example we have 2 files : one with terms expressing Positive sentiments and one expressing Negative ones.
Below you can see an example of a Tagged document, as the Tagged Document Viewer shows it.
Okay, but now I would like to know if a Review is overall Positive or Negative. This is accomplished by the workflow below. First we convert tags to strings. Then, in the Expression node, we convert 'POSITIVE' tags to an integer value +1 and 'NEGATIVE' tags to -1.
Then, in the GroupBy node, we group by documents and take the sum of the integer tag values. This way values greater than 0 will have more positive elements and values below 0 more negative elements.
For example, for the "Hole in the wall..." document, the summed value is plus 2, and as we can see in the Tagged Document viewer, this document has 6 Positive and 4 Negative tags.
On the next page we will have a brief look at the PDF Parser and Tika Parser nodes.