NLP Part-1
What is Natural Language Processing
Natural Language Processing (NLP) is the ability of a computer to understand human natural language (i.e., English, Malay, Chinese) in written text or speech.
It combines AI and Machine Learning techniques to extract and discover knowledge from textual data.
Applications of NLP
NLP can be applied in different applications such as in filtering spam emails, information extraction to get information about an author, to translate text in different languages and, to produce an automatic summary.
In recent, sentiment analysis for social media reviews, tweets and chatbot applications has been a favourite for researchers.
NLP Pipeline
The basic steps to develop an NLP application or model starts from text pre-processing, text parsing, text representation, model development using AI or Machine Learning and model evaluation.
The number of processes can differ based on the nature of the text data and the NLP model to be developed.
Text Preprocessing
There are many available Python packages or libraries such as NLTK, Scikit, TextBlob, SpaCy and Vader for text preprocessing.
Some common pre-processing steps that can be applied are:
Data Cleansing
Remove punctuations [, ! $( ) * % @]
Remove URLs, names, ids, emojis, HTML codes etc.
Convert from upper case to lower case
Tokenization
A process of separating each word from a sentence.
For example "Sue loves singing" to ['sue', 'loves', singing']
Remove stop words
Stops words are common high frequency word that has no effect to the text processing, thus can be eliminated.
For example “a”, “the”, “is”, “are”.
Stemming
Stripping suffix ending from a word so it becomes a root word.
For example singing to sing.
Lemmatization
A process of mapping a group of word to similar root.
For example the word sings, sung, and sang are from the verb sing.
Let's Practise
We will start with the Data Cleansing process using sample data from restaurant review dataset used for Sentiment Mining analysis that has 2 labels (positive and negative) with the name sample1.csv. The code below will call the remove_punctuation function and returns a new data column named remove_punct. The data in column remove_punct will then be converted to lower case and new data column name to_lower. This process ends by saving the new cleansed text data into a new csv file named sample1_clean.csv .
Sample restaurant review data before Data Cleansing
# DATA CLEANSING
# import libraries
import string, re
import pandas as pd
#reading the data
data = pd.read_csv("data/sample1.csv",encoding="ISO-8859-1")
#display the review text and sentiment column
pd.set_option('display.max_colwidth', None)
data= data [['review','sentiment']]
data.head()
#function to remove punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#store the removed punctuation review text in new column
data['remove_punct']= data['review'].apply(lambda x:remove_punctuation(x))
#function to lower case
data['to_lower']= data['remove_punct'].apply(lambda x: x.lower())
data.to_csv("data/sample1_clean.csv")
Next, in Text Preprocessing, we proceed to read the sample1_clean.csv and apply the tokenization, stop words removal, stemming and lemmatization task. We will import the NLTK library for the stemming and lemmatization. For each task we will create a new column data such as review_token after the tokenization process to check and identify each changes to the text data after preprocessing. We will save the new csv file as sample1_processed.csv after we finished the work.
# TEXT PREPROCESSING
# TOKENIZATION, STOP WORDS, STEMMING AND LEMMATIZATION
import nltk
import pandas as pd
import string, re
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
#reading the data
data = pd.read_csv("data/sample1_clean.csv",encoding="ISO-8859-1")
#1) function for tokenization
def tokenization(text):
tokens = nltk.word_tokenize(text)
print(tokens)
print("Number of Words: " , len(tokens))
return tokens
#2) function for stopwords
def remove_stopwords(text):
output= [i for i in text if i not in stopwords]
return output
#3) function for stemming
def stemming(text):
stem_text = [porter_stemmer.stem(word) for word in text]
return stem_text
#4) function for lemmatization
def lemmatizer(text):
lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
return lemm_text
#use existing stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
#if new stopwords needed, can add manually
new_stopwords = ['rolls', 'spicy']
stopwords.extend(new_stopwords)
#applying the tokenization function
data['review_token']= data['to_lower'].apply(lambda x: tokenization(x))
#applying the stopwords function
data['no_stopwords']= data['review_token'].apply(lambda x:remove_stopwords(x))
#defining the object for stemming
porter_stemmer = PorterStemmer()
#applying the stemming function
data['review_stemmed']=data['no_stopwords'].apply(lambda x: stemming(x))
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
#applying the Lemmatization function
data['review_lemmatized']=data['no_stopwords'].apply(lambda x:lemmatizer(x))
data.head()
data.to_csv("data/sample1_processed.csv")
Restaurant review data after Data Cleansing and Text Preprocessing
*This practice only covers the basic of preprocessing and you can try to apply other preprocessing task that is suitable with your text data, such as any spell checking or HMTL codes removal.