Data Preparation

Data Source

There are two primary sources of data for this project

  1. Twitter Developer API
    • Used to extract real-time tweets that were generated by twitter accounts under analysis.
  2. Bot repository of Botometer project
    • "Varol-2017" data set , a pre-classified set of twitter users that was used as the basis to train and validated the classification models
    • "cresci-2017" data set, a filtered list from the pre-classified set of twitter users that was used as the test data for evaluating the classification models

Nature of Data collected

Raw Tweets

There are certain limitations imposed while using the normal account subscription with Twitter Developer API, which restricts us from collecting tweet data for users over a specific period of time and over a certain number of tweets from the user's history. There are also limitations in ability to access tweets of certain users who have been marked as protected or restricted users through normal subscription accounts.

So, in order to do a trend analysis of the tweets generated by the users under analysis, up to 3200 (Maximum limit imposed by twitter) most recent tweets generated by a user was extracted using tweepy library. These tweets were taken as the basis to perform feature extraction and analysis.

Labelled Twitter Account data set

A key challenge in the data collection exercise was to build a data set with labels classifying users as either bots or humans. After analysis and research of various data sources available that provide a pre-classified data set, we chose to use a "Varol-2017" data set from the bot repository of Botometer project. This decision was made based on the recent timelines of the data set and and size of the data set. The data set had a total of 2573 accounts which contained 826 accounts classified as Bots and 1747 accounts classified as Human users. This data set was used as the base for training and validating the models.

A similar data set "cresci-2017" from the bot repository of Botometer project was used to generate the test data set after filtering down the number of users. This data set had a total of 918 users which contained 392 accounts classified as Bots and 526 accounts classified as Human users.

Data Preprocessing

  • The Raw tweet object for each user extracted using Twitter Developer API is in JSON format and has lot of user level metadata and tweet level information. Please refer to the link associated with the tweet object to get the complete information made available as part of the tweet feeds.
  • The tweet object had a huge set of metadata and since not all the attributes will be relevant to our objective, the first step was to understand the various meta information that was available in the JSON feed and classify them into different categories like user related information, tweet level information, entity information that was used in tweets, geographic information that was used or referenced, etc.
  • Based on the understanding gained from the tweet object, the metadata and features that were going to be extracted for use in our models were shortlisted. It contained both User and Tweet level information.
  • Once the meta information to use for classification and analysis were shortlisted, the JSON feeds for the Tweets data was parsed to extract the identified metadata.
  • The raw attributes extracted from the JSON feeds were then cleansed and processed to extract the following categories of feature sets
    • User Metadata
      • Focuses on all the metadata associated with the individual user account
    • Friends and Network Analysis
      • Focuses on the followers and followings of the user account and additional factors like being part of public lists, etc.
    • Timing Analysis
      • Focuses on the user account's frequency on tweeting, etc.
    • Sentiment Analysis
      • Focuses on calculating the subjectivity and polarity of the tweet texts created by the user.
    • Content/Topic Analysis
      • Focuses on all the tweet level metadata like types of entities and counts linked on average in a tweet, types of tweet messages like if it is a original, quote or a retweet, etc.
      • Grouping the tweets created by the user into a group of topics.

Feature Extraction

Based on the analysis and shortlisting done in the data preprocessing stage, following are the list of features extracted as part of the data set for exploratory data analysis.

Features are bucketed under the feature categories mentioned above

  • User Metadata
    1. User ID - ID of the Twitter user
    2. Bot Flag - Response classification Flag indicating if the user is a human user (1) or a bot (0)
    3. Default Profile - Flag indicating if the user has made any changes to the default profile assigned during user creation. 1 for default and 0 if updated.
    4. Default Profile Image - Flag indicating if the user has replaced the default profile image assigned during user creation. 1 for default and 0 if updated.
    5. Favorites Count - Number of tweets that the user had liked in the account's lifetime
    6. Statuses Count - Number of tweets (including retweets) issued by the user
    7. Geo enabled - Flag indicating if the user has enabled geographic tagging of their tweets. 1 for yes and 0 for no.
    8. Protected - Flag indicating if the user has chosen and set to protect their tweet. 1 for yes and 0 for no.
    9. Verified - Flag indicating that this user as an account that has been verified by Twitter. 1 for yes and 0 for no.
    10. Has Extended Profile - Flag indicating that this user has an extended profile with lot of updates. 1 for yes and 0 for no.
    11. Is Translator - Flag indicating that this user is participant of Twitter's translator community.
    12. User Language - The BCP 47 Language identifier of the user account. One hot encoded attribute.
  • Friends and Network Analysis
    1. Followers Count - Number of followers the account has at the current time
    2. Friends Count - Number of users this account is following
    3. Listed Count - Number of public lists that this user is a member of
  • Timing Analysis
    1. Frequency of Tweets - Frequency of tweets generated by this user per hour.
  • Sentiment Analysis (sentiment analysis was performed using the TextBlob library)
    1. Polarity Score - Gives the sentiment orientation (negative, neutral, and positive) in the text message. -1 for extremely negative to +1 for extremely positive.
    2. Subjectivity Score - Identifies if the text message is objective (a fact) or subjective (an opinion). 0 is very objective and 1 is very subjective.
  • Topic Modeling (Topic analysis was performed using spacy and nltk libraries)
    1. Hash Tags - All the topics discussed in the tweet message by the user
  • Content/Topic Analysis
    1. Place Reference - Flag indicating if the tweet content has references to a geographic place. 1 for yes and 0 for no.
    2. Possibly Sensitive - Flag indicating if the tweet content has potentially sensitive information based on Twitter's analysis. 1 for yes and 0 for no.
    3. Is a Reply -Flag indicating if the tweet is a reply to another tweet. 1 for yes and 0 for no.
    4. Is Quote Status - Flag indicating if the tweet is a Quoted tweet. 1 for yes and 0 for no.
    5. Is a Retweet - Flag indicating if the tweet is a retweet of another tweet. 1 for yes and 0 for no.
    6. Favorite Count - Number of times a tweet has been liked by other twitter users.
    7. Retweet Count - Number of times this tweet has been retweeted by other twitter users.
    8. Hash Tags Count - Number of Hash Tags (Topics) referenced in a tweet message.
    9. Media Count - Number of media (image, video, etc.) attachments used in a tweet message.
    10. URL References Count - Number of URL references made in a tweet message.
    11. Symbols Count - Number of cashtags used in a tweet message.
    12. User Reference Count - Number of twitter user references made in a tweet message.

Feature Consolidation

The features extracted were at both the user level and tweet level, there can be up to 3200 tweets per user based on our extraction limits. So, the features needed to be consolidated to the user level. In order to achieve this, the tweet level features extracted were aggregated for each user and an average was taken in most of the cases that would give us a indication of the tweeting pattern of that user.

Tweets only up to a maximum count of 3200 might not be a significant number to do a complete time series analysis but we are limited by the constraints imposed by the traditional twitter developer application accounts used for data extraction. There were also cases where few features and attributes of interest (like number of times a tweet was quoted, number of times a tweet was replied to, number of polls referenced in tweet messages, matching rules that provides nature of topics, etc. ) which were limited and available only to premium subscription accounts.

The aggregated tweet level features were then consolidated with the user level features to form the baseline data set for conducting exploratory data analysis, corrections, updates and eventually model training using a final set of features.