Natural Language Processing
Sentiment Analysis Using NLTK
The Modules used for Basic Sentiment Analysis Project are:
- Pandas - Data Munging, Reading / Writing Data Frames
- NLTK - Natural Language Tool Kit- an open source library in Python for Natural Language Processing
- RE - Regular Expression for text processing
- Matplotlib - Plotting and Visualization
- String - Dealing with String data
Basic Steps in NLP Project:
Basic Steps in NLP Project:
- Data Collection
- Data Preparation
- Model Selection
- Model Building
- Model Training
- Evaluation
- Fine Tuning and Testing
Data Selection
Data Selection
Using the Pre-Loaded data from NLTK
from nltk.corpus import twitter_samples
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
Data Visualization
Data Visualization
fig = plt.figure(figsize=(5, 5))
labels = 'Positives', 'Negative'
sizes = [len(all_positive_tweets), len(all_negative_tweets)]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.show()
Data Preparation
Data Preparation
Most of the data we receive in real world is dirty data. It needs lot of pre-processing and cleaning
Regular Expression are used to Pre-process the data:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
tweet2 = re.sub(r'^RT[\s]+', '', tweet)
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)
tweet2 = re.sub(r'#', '', tweet2)
Tokenization
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet2)