Text processing tips

Preprocessing tools

🔹Deleting stop words

🔹Improving the text

🔹Visualisation using wordclouds

Vector representation of a text

🔹Bag of words:

🔹TF-IDF (Term Frequency - Inverse Document Frequency)

Here I collect some resources/tips that have helped me recently and that are necessary for various word processing tools and in general for working with text as data.

Preprocessing tools

🔹Deleting stop words

The texts are linguistically composed of phrases constructed from verbs, nouns, adjectives, adverbs, proper nouns, etc. but also with articles, connectors, pronouns, common adverbs and prepositions. The latter type of words are called empty words, because they don't provide any valuable meaning about what the text wanted to say and that is important when working with Natural Language Processing (NLP) tools. Therefore, in most works they are usually eliminated. For example, suppose we are working with the "The little prince" book and the phrase:

"The essential is invisible to the eyes"

The stop words cleaning works like this:

"The essential is invisible to the eyes"

In phyton you can use the packages:

NLTK:
SpaCy

In R you can use the packages:

tm
quanteda (for this package you need to build first your texts whit a corpus format)

tidytext

stopwords

In all of these packages you can add your own list of stop words and in some of them other languages are available (Beyond English, of course). You can add, for example, numbers, punctuation symbols, etc.

🔹Improving the text

Sometimes when we work with text, our data source may have, for example, typos that affect any further processing or in general noisy text. In Python there is a package called SymSpellpy which allows you to make text corrections. The way it works is that it needs a dictionary of the language we are working in (in the case of Spanish I had used this one, but there are many more available) and we must tell it the distance of operations that we will allow. For example: if the distance is 5 it means that if the algorithm finds an error, we allows it to make 5 edits like: detach, delete, join, replace, etc. For example:

Te esssential isinvizible to teh eyes.

The esssential is<->invizsible to theh eyes.

🔹Visualisation using wordclouds

In phyton you can use the wordcloud package and in R the worcloud package, but also the ggwordcloud package.

Fig: Words most used by female politicians in their campaign slogans in Colombia

Vector representation of a text

The main idea of word vectorization is to convert words into numbers and graph them in vector space, allowing algebraic manipulation of words. For this there are different methods, depending on what we want to do and the supplies we have. In the following graphs is the example of the vectors close to the word "govern" in the government missions of politicians in Colombia, differentiated by gender.

🔹Bag of words:

Represents the text as a matrix where the rows are documents and the columns are the unique words in the corpus. The values indicate the frequency of each word in a document. For this there are two common techniques: C-bow and Skip-gram. C-bow: The model is fed by the context and predicts the target word. Skip-gram: The model is fed with the target word and predicts the context words. In this paper there is a detailed explanation.

In python I had used Word2Vec from gensim and in R the text2vec package.

🔹TF-IDF (Term Frequency - Inverse Document Frequency)

⚠️Under construction

Page updated

Report abuse