If we want to use the text in our Documents to train machine learning models, there are 2 additional steps we need to perform:
1) clean the text (nodes in the red box). Most of these nodes have a 'replace column / append column' configuration option. If you use 'append column' you can easily compare the 'before' and 'after' versions of the text by looking at the node output.
The Number Filter node will remove numbers from the text
The Punctuation Erasure node will eliminate punctuation from the text
Stop Word Filter will remove stop words from the text. These are words that add little or no meaning, like 'the', 'we', 'this'. You can either use a built in list of stop words (and choose from different languages) or provide your own custom list.
The Case Converter node will simply convert all your text to upper or lower case
The Snowball Stemmer node stems words. Stemming is explained in detail here
finally , Tag Filter allows to filter (remove) terms with selected tags from the text
2) create a numeric representation of the text (nodes in the blue box)
the Bag of Words creator creates, well, a bag of words. This is a table containing each term encountered in each of the documents. In the following 4 nodes we will reduce the size of this table by keeping only the Terms that occur in at least 5 Documents.
the next node is TF, Term frequency. It computes the relative term frequency (tf) of each term according to each document and adds a column containing the tf value. The value is computed by dividing the absolute frequency of a term according to a document by the number of all terms of that document.
The Document Vector node creates a vector of Terms and Documents. There will be one row for each Document and one column for each Term in the Bag of Words. The value of each 'cell' will be 1 of the Term is contained in the Document and 0 if it is not.
Finally, the Document Data extractor node will extract the (restaurant) Category metadata from the Document and add it as column to the output table
On the next page we will see how to classify our documents.