Data Cleaning and Preprocessing

Data Extraction from Tweet IDs through Hydration

After collection of tweet IDs during the data collection process, we hydrated the tweet IDs using twarc (python library). For this, we made a twitter developer account which helped us get the necessary keys and further enabled us to use twarc library to hydrate the tweets IDs to get the original tweet objects.

Selection of attributes from Original (hydrated) tweet object

Extracted relevant attributes from original tweet object extracted by hydrating the tweet IDs using twarc.
The attributes extracted were - (created_at, id_str, full_text, retweeted_status, created_at, full_text, hashtags, and user_mentions of retweeted object, user_location, user_verified, place, retweet_count, retweeted, lang, is_quote_status)

Original Tweet Object

Extracted Tweet Object

Extraction of countries from user location

Aim: to get the countries from the user_location attribute stored in tweet objects for collecting the sentiments country-wise for all the 9 days.

Use: To observe the change of sentiments with respect to the number of cases for all the countries.

Data format: Data of user locations contains noisy and inconsistent values.

First Method: Using geocode library to extract the country name.

We used two libraries from geocode toolbox in python - Nominatim and OpenMapQuest. It can detect any address (place, town, county, cities, state) and provide the country name.

Problems:

- - Takes 24 seconds/1000 tweets.. which means 23 hours for 33 lakh tweets (for one file).. means more than 10 days for all files. NOT FEASIBLE.
  - Gives a lot of false positives - like it would give a country on texts like - “good morning”, “good night”, etc.

Second Method: Using preprocessing techniques and iso3166 library to detect countries

We first removed the rows where user location were coming out to be null.
Parsing the data to extract the last word of the user location attribute
Using iso3166 library to check if the last word is a country or not.
Using parseapi (https://parseapi.back4app.com/classes/Country) to fetch the states and cities of the major countries such as USA, Canada, China, India, Australia, Russia, China, Brazil, Iran, South Africa, Spain, Germany, France, UK.
After finding cities and states, we check if any of the cities or countries is there in the user location attribute, if yes then we store that country, otherwise we ignore.
Used multiprocessing to execute the above pre-processing on the large dataframes to speed up the computation for all the 9 files (corresponding to the 9 dates).
RESULT:
- - - Approx 40% of tweets were null and useless. Hence, almost all files reduced to 15 lakh tweets due to pre processing.
    - This computation happened in just 15 minutes for all the 9 files (~315 lakh tweets).
    - No false positives were generated, unlike the first method.
    - Problem: In country column, there were some values which were not names but alpha 2 and alpha 3 codes of the countries instead.

Converting alpha 2 codes to country names

- There were locations (country names) which was recognized as a country by isso3166 library but those were alpha 2 and alpha 3 codes.
- Used pycountry library to get the country names for those whose alpha 2 and alpha 3 codes were there in country column. (used apply, lambda, etc functions of pandas for the same)

Output of one of the files

Filtering the tweets based on different trends

We analyzed different trends happening around the world in the months of August and September.
Found out the trending hashtags and keywords corresponding to each trend.
Used regex library of python to filter out the texts of all tweets based on different trends in all the 9 files.

Merging Pre-process files with Sentiment Analysis files

After the analysis part, we merge the pre-process files with the sentiment analysis files which contained 5 emotions for all the tweets to get the sentiments based on Country.
After merging, we grouped the data frames based on country and found out the mean of all emotion values.
Then, we used pycountry library to find out the alpha 3 codes of the country. This helps in plotting the visualization curves on the world map.

Output of one of the files

Page updated

Report abuse