Natural Language Processing

Sentiment Analysis Using NLTK

The Modules used for Basic Sentiment Analysis Project are:

Pandas - Data Munging, Reading / Writing Data Frames
NLTK - Natural Language Tool Kit- an open source library in Python for Natural Language Processing
RE - Regular Expression for text processing
Matplotlib - Plotting and Visualization
String - Dealing with String data

Using the Pre-Loaded data from NLTK

from nltk.corpus import twitter_samples

all_positive_tweets = twitter_samples.strings('positive_tweets.json')

all_negative_tweets = twitter_samples.strings('negative_tweets.json')

fig = plt.figure(figsize=(5, 5))

labels = 'Positives', 'Negative'

sizes = [len(all_positive_tweets), len(all_negative_tweets)]

plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')

plt.show()

Most of the data we receive in real world is dirty data. It needs lot of pre-processing and cleaning

Regular Expression are used to Pre-process the data:

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.tokenize import TweetTokenizer

tweet2 = re.sub(r'^RT[\s]+', '', tweet)

tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)

tweet2 = re.sub(r'#', '', tweet2)

Tokenization

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

tweet_tokens = tokenizer.tokenize(tweet2)