Natural Language Processing

Sentiment Analysis Using NLTK

The Modules used for Basic Sentiment Analysis Project are:

  • Pandas - Data Munging, Reading / Writing Data Frames
  • NLTK - Natural Language Tool Kit- an open source library in Python for Natural Language Processing
  • RE - Regular Expression for text processing
  • Matplotlib - Plotting and Visualization
  • String - Dealing with String data

Basic Steps in NLP Project:

    • Data Collection
    • Data Preparation
    • Model Selection
    • Model Building
    • Model Training
    • Evaluation
    • Fine Tuning and Testing

Data Selection

Using the Pre-Loaded data from NLTK

from nltk.corpus import twitter_samples
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

Data Visualization

fig = plt.figure(figsize=(5, 5))
labels = 'Positives', 'Negative'
sizes = [len(all_positive_tweets), len(all_negative_tweets)] 
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')  
plt.show()

Data Preparation

Most of the data we receive in real world is dirty data. It needs lot of pre-processing and cleaning

Regular Expression are used to Pre-process the data:

from nltk.corpus import stopwords         
from nltk.stem import PorterStemmer     
from nltk.tokenize import TweetTokenizer

tweet2 = re.sub(r'^RT[\s]+', '', tweet)
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)
tweet2 = re.sub(r'#', '', tweet2)

Tokenization

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet2)