Exploratory Data Analysis

A detailed analysis was done to explore and understand the distribution of the features that were extracted, the relation between them, etc. This analysis was required to determine the importance of features, updates or changes that might be required for some features, features that might have to be merged or split, features that needs to be dropped, etc.

Feature Relationship Analysis

In order to understand the relationship of numerical features, multiple pair plots and heat maps were generated. Pair plots were generated for features in specific categories to understand relationship between features within that category. The two primary categories considered were features that belonged to the user level and the features that belonged to the tweet level. A heat map was also generated with the correlation scores across all the numerical features in both these categories to infer the relationship and collinearity between the features identified.

Pair Plot of Numeric features at the User Level

Pair Plot of Numeric Features at Tweet Level

Heat Map on Correlation scores of the Numeric features both across User and Tweet level

From the above relationship plots and the heat maps generated on the correlation matrix, we can safely say that the predictor variables are not highly correlated in most cases and hence there should not be a problem of multicollinearity.

Feature Distribution Analysis across Human and Bot classes

In order to understand how the features are distributed across human and bot classes, a series of distribution plots were generated and analysed. The reason behind generating these plots were to understand how the data is distributed differently between human and bot classes and based on the distribution assess if the feature would play an effective role if used to train the classification models. The series of distribution were again generated in the category of user level and tweet level features for a better grouping.

Distribution of Tweet Level Features across Human and Bot users

The above set of distribution plots clearly shows a varying patterns in the way media, URL, Hash tags and user references are being used in tweet text messages.

Distribution of Sentiment Analysis scores across tweet messages generated by Human and Bot users

The distribution plots on the sentiment analysis scores in terms of the Subjectivity of the messages generated and the emotional (polarity) level of the messages generated also shows a difference in the messaging patterns of tweet texts generated by these class of users.

Distribution of Tweeting Frequency of Human and Bot Users

Although, it was expected to see a huge difference in the tweeting frequency between the human and bot users, the distribution of tweeting frequency in hours shows a slightly similar tweeting pattern between the Human and Bot users. But, we are retaining this feature as it can be an anomaly based on the users under study and the expectation is that this feature would play a key role in the real world or test data sets.

Distribution of User Level Features across Human and Bot users

The distribution plots on the user level features shows differences in the pattern of Followers and Friends between human and bot users. This was expected as the bot users are expected to have less network counts in terms of friends and followers. The distribution of the number of tweets the user has liked in their lifetime also shows a significant difference between bots and users and is in expected pattern. Most of the bot users have not liked any tweet in their lifetime. The distribution of the number of tweets (statuses) generated shows a slightly similar patterns.

Distribution of categorical features across bot and human users

A series of distribution plots were generated on the categorical or flag type features both on the user level and tweet level to determine their distribution. The plots showed that many of these features almost had the same unique values for a majority of the data set and a very low distribution across the categories. This leads us to believe that these features might not play a key role in the model effectiveness and can be removed from our final data set. So, based on this observation features like protected, verified, is translator, possibly sensitive, place reference, default profile image, default_profile and symbols_cnt will be removed from the final data set.

The distribution of language code (BCP 47) used in the user's account settings was analyzed to see if that can be used as a categorical feature that can be one-hot encoded and if that would provide signs of difference across the bot and human users. But, the distribution analysis shows that almost all the users in both human and bot classes always uses English as their language code. So, user_lang feature might not prove effective for the classification models and would be removed from the final data set.

Feature Ranking

A set of features has been updated and removed based on our observations from the EDA exercise. Now, we evaluate the importance of the remaining features in our predictor set using ExtraTreesClassifier technique. This classifier is an ensemble technique that would fit the predictors from the training data set using multiple learners and determine the relative importance of the remaining predictors for the purpose of classification. The following plot shows the relative importance score of all the predictors in the data set and how much significance it would have in the classification process.

The importance ranking scores observed are mostly inline with our expectations. For example,

    • url_ref_cnt feature indicates the number of url references included in the message. Based on observations, Bot class users are likely to include lot of URL references in their messages usually. So, this feature plays a higher role in the classification process.
    • Same is the case with features like is_a_reply and is_a_retweet. Human users are mostly expected to perform these most often than the Bot class users.

Feature Importance prediction using Extra Trees Classifier

Based on the above plot, we can be fairly confident that all the predictors in the data set has relevance to the classification process and none of them need to be removed.

Final Feature Set

Based on all the relationship and distribution analysis that has been made so far, the following predictor features would be used to train and validate our classification models.

The final set of predictors identified will be scaled and normalized to follow a standard range before fitting and training the models.

Topic Modeling Analysis

In order to perform topic based modeling all the topics discussed in the tweet messages of the users (Bot and Human) were extracted and a consolidated feature "hash_tags" was formed during the data extraction and processing phase. These set of topics extracted from the tweet data would be used to perform a classification based on a topic modeling approach.

The following plots shows the most commonly discussed topics in the tweet messages of human and bot user classes.

From observation of both these plots, there are lot of different topics that is most frequently used between the Bot and the Human user classes. There are some similarities like tbt (Throw Back Thursdays), rt (Retweet), ff (Follow Friday). But these are expected considering the popularity of these topics discussed in twitter. But, there are also lot of unique topics and that should help us with effective topic based classification modeling. On a funny note, who knew that "love" would be a frequently discussed topic among bots as well :)