NLP Part-1

What is Natural Language Processing


  • Natural Language Processing (NLP) is the ability of a computer to understand human natural language (i.e., English, Malay, Chinese) in written text or speech.

  • It combines AI and Machine Learning techniques to extract and discover knowledge from textual data.

Applications of NLP

  • NLP can be applied in different applications such as in filtering spam emails, information extraction to get information about an author, to translate text in different languages and, to produce an automatic summary.

  • In recent, sentiment analysis for social media reviews, tweets and chatbot applications has been a favourite for researchers.

NLP Pipeline

  • The basic steps to develop an NLP application or model starts from text pre-processing, text parsing, text representation, model development using AI or Machine Learning and model evaluation.

  • The number of processes can differ based on the nature of the text data and the NLP model to be developed.

Text Preprocessing

  • There are many available Python packages or libraries such as NLTK, Scikit, TextBlob, SpaCy and Vader for text preprocessing.

  • Some common pre-processing steps that can be applied are:

    1. Data Cleansing

      • Remove punctuations [, ! $( ) * % @]

      • Remove URLs, names, ids, emojis, HTML codes etc.

      • Convert from upper case to lower case

    2. Tokenization

      • A process of separating each word from a sentence.

      • For example "Sue loves singing" to ['sue', 'loves', singing']

    3. Remove stop words

      • Stops words are common high frequency word that has no effect to the text processing, thus can be eliminated.

      • For example “a”, “the”, “is”, “are”.

    4. Stemming

      • Stripping suffix ending from a word so it becomes a root word.

      • For example singing to sing.

    5. Lemmatization

      • A process of mapping a group of word to similar root.

      • For example the word sings, sung, and sang are from the verb sing.


Let's Practise

We will start with the Data Cleansing process using sample data from restaurant review dataset used for Sentiment Mining analysis that has 2 labels (positive and negative) with the name sample1.csv. The code below will call the remove_punctuation function and returns a new data column named remove_punct. The data in column remove_punct will then be converted to lower case and new data column name to_lower. This process ends by saving the new cleansed text data into a new csv file named sample1_clean.csv .

Sample restaurant review data before Data Cleansing

# DATA CLEANSING

# import libraries

import string, re

import pandas as pd


#reading the data

data = pd.read_csv("data/sample1.csv",encoding="ISO-8859-1")

#display the review text and sentiment column

pd.set_option('display.max_colwidth', None)

data= data [['review','sentiment']]

data.head()


#function to remove punctuation

def remove_punctuation(text):

punctuationfree="".join([i for i in text if i not in string.punctuation])

return punctuationfree


#store the removed punctuation review text in new column

data['remove_punct']= data['review'].apply(lambda x:remove_punctuation(x))

#function to lower case

data['to_lower']= data['remove_punct'].apply(lambda x: x.lower())

data.to_csv("data/sample1_clean.csv")

Next, in Text Preprocessing, we proceed to read the sample1_clean.csv and apply the tokenization, stop words removal, stemming and lemmatization task. We will import the NLTK library for the stemming and lemmatization. For each task we will create a new column data such as review_token after the tokenization process to check and identify each changes to the text data after preprocessing. We will save the new csv file as sample1_processed.csv after we finished the work.

# TEXT PREPROCESSING

# TOKENIZATION, STOP WORDS, STEMMING AND LEMMATIZATION


import nltk

import pandas as pd

import string, re

from nltk.stem.porter import PorterStemmer

from nltk.stem import WordNetLemmatizer


#reading the data

data = pd.read_csv("data/sample1_clean.csv",encoding="ISO-8859-1")


#1) function for tokenization

def tokenization(text):

tokens = nltk.word_tokenize(text)

print(tokens)

print("Number of Words: " , len(tokens))

return tokens


#2) function for stopwords

def remove_stopwords(text):

output= [i for i in text if i not in stopwords]

return output


#3) function for stemming

def stemming(text):

stem_text = [porter_stemmer.stem(word) for word in text]

return stem_text


#4) function for lemmatization

def lemmatizer(text):

lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]

return lemm_text

#use existing stop words present in the library

stopwords = nltk.corpus.stopwords.words('english')

#if new stopwords needed, can add manually

new_stopwords = ['rolls', 'spicy']

stopwords.extend(new_stopwords)


#applying the tokenization function

data['review_token']= data['to_lower'].apply(lambda x: tokenization(x))


#applying the stopwords function

data['no_stopwords']= data['review_token'].apply(lambda x:remove_stopwords(x))


#defining the object for stemming

porter_stemmer = PorterStemmer()

#applying the stemming function

data['review_stemmed']=data['no_stopwords'].apply(lambda x: stemming(x))


#defining the object for Lemmatization

wordnet_lemmatizer = WordNetLemmatizer()

#applying the Lemmatization function

data['review_lemmatized']=data['no_stopwords'].apply(lambda x:lemmatizer(x))


data.head()

data.to_csv("data/sample1_processed.csv")

Restaurant review data after Data Cleansing and Text Preprocessing

*This practice only covers the basic of preprocessing and you can try to apply other preprocessing task that is suitable with your text data, such as any spell checking or HMTL codes removal.