Machine Learning Final Project

By Daniel Neel and Alexander Williams

How to solicit goodwill: A machine learning investigation into altruism

Full video script: https://sites.google.com/vt.edu/fall-2017-cs5824-group-5/home/video-script

Background

For this investigation, we trained various machine learning models to predict the success and public sentiment toward altruistic requests. The information we used for testing came from the "Reddit Pizza Requests " data set provided by SNAP. This data contains metadata from 5671 reddit posts under the "Random acts of pizza" (RAOP) subreddit. Each post is stored as a JSON objected, with attributes such as votes, message bodies, time of request, success, and many others. Due to the JSON formatting of the post and the breadth of information contained, this dataset could be efficiently parsed and inserted into machine learning algorithms in Python.

The case study “How to Ask for a Favor” from Althoff and colleagues examines this data, extracting social features and modeling a predictor of success via logistic regression. Their work was quite helpful in determining the more influential elements behind a poster successfully receiving food from a generous donor [1]. While we attempted to simply reproduce these findings in our early steps and expand them to other machine learning models for comparison, the broader idea behind our study involved analysis and prediction of public reception toward an altruistic appeal. That is, we aimed to predict how positively an altruistic request would be viewed by the public, which we believe may ultimately be a more generally applicable view of altruistic modeling. For example, information on factors such as timing and wordage could be valuable for any organization or individual setting up a fundraiser or charity.

Additionally, Maguire and Michelson conducted a multi-model analysis predicting popularity of online social news posts from HackerNews. Though the target of their evaluation is not altruistic in nature, their tools and intention were certainly comparable, since general popularity is a factor of public response [2]. Their successes in multiple learning models helped us affirm our strategy and helped us consider potential key factors in our own dataset. The Greenberg study also suggested that a bandwagon style effect may also skew reception of an altruistic request, implying that popularity may be a complicated and circumstantial factor that machine learning techniques need to consider within the context of other features [3]. This applies occasionally in testing where pruning, was applied, defined as excluding examples with few total votes.

The above video describes the process for determining how information was pruned and formatted. Testing is also covered, and a brief summary of the results follows the video link.

Model Construction

The RAOP data was converted from its base json format to a python dict where each key is a descriptor of a trait, and the value is the corresponding array holding that trait for all examples, maintaining their consistent original ordering.

All models are supervised and founded on course material, written using Python 3 with the help of efficient NumPy operations. Decision Trees are formed by greedily choosing the best decision rule by selecting the binary feature with the highest gain of all those remaining, splitting the current node into two subsets and recursively training until reaching a predefined maximum depth. Naive Bayes modeling creates probabilistic classification by considering all features independently and estimating the maximum likelihood of a given label. Perceptron modeling is online, processing/classifying one example at a time and updating its weights upon failure.

For the Decision Tree and Naive Bayes methods, 100 boolean feature arrays were established. These covered the existence of particular key words in the body or title of the post, which were selectively included if they met a proportional threshold of appearances across the given dataset. An included unix timestamp of each post allowed various time features including time of day, day of week, day of month, month, year, and (after some specific conditionals) proximity to holidays. Among others, boolean features were also established using the relative length of each message's title and body, features of the requester’s account such as age and previous activity, and the inclusion of an image within the post by searching for extensions such as ".jpg", ".png", and ".gif".

For logistic regression and perceptron modeling, data could be grabbed more directly from the dataset and represented in floating point format. Rather than relative categories, we could implement actual word counts of message bodies and titles, specific temporal data, and ratios given as fractions rather than independent boolean classes. We also took a cue from Althoff [1] here, tallying the number of words found within predefined narratives - categorization of the reasoning behind the post, such as ‘money’, ‘family’, or ‘school’. These values underwent normalization to fit into more condensed ranges.

(For multiclass reception prediction, features such as the number of upvotes, downvotes, and comments had to be omitted to prevent a strong dependence between feature and label. Because traits such as these occur after the publication of a request, and thus are dependent on the other features, they would invalidate results by implying the classification to an unfair degree. )

Additionally, multiple factors were considered when balancing data for modeling. Because the success/failure ratio of the basic dataset is 1:3, a random subset of the failing requests were omitted from our train & test data in a given run of binary classification. It was less trivial to balance class distributions in multiclass modeling - We created an ascending list of ratios for all included examples, and found the values at the 1/3 and 2/3 or 1/4, 1/2, and 3/4 breakpoints based on number of classes. However, class count was often still a little skewed, especially when not pruning data, because there may have been many examples with the same ratio. To counter this, we created multiple distributions with differing conditionals on the chosen breakpoints and then calculated the standard deviation between class counts to find the least skewed distribution.

Model Testing Results

The below results show the averages from 23 test runs of different machine learning models, with randomized partitions between training and testing examples under a consistent 80/20 split. There are 3 distinct testing groups shown, and 2 different outcomes measured from the "Reddit Pizza Requests " data set from SNAP.

Bars shown in blue represent non-pruned data: For multiclass models, this represents the entire dataset; for binary classification, this includes the full suite of successful requests and an equivalent, and random, subset of failing requests.

The results in shown in green represent the runs with the most optimal pruning factor. A test that is pruned excluded examples whose total vote count (not net up/downvote score) fell below a vote threshold, which ranged anywhere from 3-11 Reddit votes depending on the below tests (specific threshold numbers are labeled in the presentation and spreadsheet).

Model abbreviations are as follows: NB refers to Naive Bayes, DT to decision tree, Log to logistic regression, and Per to the Perceptron model. See below the images for the spreadsheet information that these graphs were derived from.

Result_Averages.xlsx

Binary Classification

The binary classification results are the tests that attempted to predict the success of a pizza request. This represents the goal being investigated by Althoff et al [1], so this acted foremost as a proof of concept for our implemented models and an affirmation that we could reproduce their results to a reasonable degree. With the exception of the perceptron model, each test performed appreciably better than random chance at around 70%. Pruning actually appeared to slightly hinder results in this case, which is probably explained by low vote count posts possessing features that correlate strongly with high failure rates, so the models were losing valuable data that could reinforce negative predictions.

Decision tree modeling succumbed to overfitting quite easily, generally providing the most accurate results with maximum depths of 4 or less. Additionally, we found that decision trees outperformed Naive Bayes on average in this success/fail prediction, and it often matched logistic regression's best case, showing that simplicity may possess an edge in this arena. That is, it appears that a fairly small subset of factors correspond heavily to success or failure relative to the rest. However, this subreddit provides a very focused and possibly specialized type of altruistic study, so attempting to incorporate a model from here for a general charity would likely suffer from overfitting itself.

Additionally, we implemented code to display the positive correlation of more active boolean feature (post-pruning) by calculating the ratio between their presence (being True) and success to their overall presence (times they're True, regardless of success or failure). Conversely, we do the same with negative correlation using their absences on failures relative to their total absences. These values were compared against a chosen baseline value and printed to see their relative performance, with highly positive and highly negative values implying more determinant features. An example output for no pruning can be seen at https://sites.google.com/vt.edu/fall-2017-cs5824-group-5/home/correlations

Multiple Outcome Classification

For 3 and 4 class classification, models were designed to predict the relative ratio of upvotes to total votes for a given request. Under the distribution practice explained earlier, this could mean a model has fairly uniform classes representing negative reponse, neutral response, and positive response based on requests with under 50% upvotes, with 50% to 75% upvotes, and with greater than 75% upvotes. Or, these ratios could be divided at 40% and 62%, or perhaps 65% and 80% - The class counts determine the delimiters, not the other way around.

In these cases, pruning did consistently improve results; this is understandably due to the fact that a small number of votes can easily skew the ratio for a given example. A post with a single upvote would be classified as "Very positive," despite our intuition that one vote is a very weak basis for understanding broader public reception. Unlike before, Naive Bayes modeling pulled ahead of Decision Trees when under equivalent pruning sets - Pruning removes the entries with conflicting features that otherwise negatively impact NB, but these features are generally omitted from the DT models because it hits maximum depth before their gain becomes relevant.

Most models in these test runs outperformed random chance guessing to a notable degree, proving some correlation between features and response and furthering the justification to study this concept on a larger scale. In particular, the best case results for logistic regression averaged 49% for 3-class prediction (random ~33%) and 38% for 4-class prediction (random ~25%). These results are not surprising, insomuch as the comparably more complex model is often expected to outperform the simpler ones. That said, we believe that a more involved feature selection process, one with a clever analysis of request wordage, could further logistic regression's advantage in this area. (Logistic regression in our tests is also prone to slight overfitting on the training data, so additional work could be done to alleviate that)

Note: Again, perceptron turned out to be the weakest, but these numbers are closer to random chance than a reasonable model should allow. The perceptron classification result represents the average taken across all iterations, as our model returns very erratic accuracies that imply a fault on our end. The perceptron model was a late-stage addition that we believe could be better implemented given time and a different care in data handling.

Project Files

Below are links to our python files and notebooks. The notebooks can be run, given files are in the assumed local directories and the system has installed the necessary files (i.e., NumPy).

Boolean variables near the tops of the notebooks flip options such as multiclass count (3 or 4), the inclusion of certain features such as vote totals and comment totals, the use of boolean arrays vs floating point arrays from the dataset, and the use of principal component analysis. Additionally, the 'tr' value represents the trim, forcing the code to eliminate entries below the given threshold of votes.

bayes_dt_binary -> Decision tree and Naive Bayes modeling for binary classification (with exclusively boolean array features)

bayes_dt_multi -> Decision tree and Naive Bayes modeling for multiclass prediction

bayes_multi_auto -> Decision tree and Naive Bayes modeling for multiclass prediction, set to randomize data and accumulate accuracies across a given loop value and repeat for incrementing 'trim' pruning thresholds

perc-log* -> Corresponds to the above files, with logistic regression and perceptron models using floating point arrays (or boolean, if chosen)

read_dataset.py -> Reads the json pizza request dataset and breaks into two dicts representing our floating point and boolean feature sets

functions.py -> Various functions to facilitate operation of our code, from performing pruning to normalization to partitioning the training and testing subsets

decision_tree/naive_bayes/linearclassifier -> Define the core of the training and testing algorithms for associated models

REFERENCED STUDIES

[1] Althoff et. Al, “How to Ask for a Favor: A Case Study on the Success of Altruistic Request”. Stanford University. 2014. https://web.stanford.edu/~jurafsky/pubs/icwsm2014_pizza.pdf

[2] Maguire, Joe and Michelson, Scott. “Predicting the Popularity of Social News Posts”. Stanford University. 2012. http://cs229.stanford.edu/proj2012/MaguireMichelson-PredictingThePopularityOfSocialNewsPosts.pdf

[3] Greenberg et al., “Crowdfunding support tools: predicting success & failure”. CHI EA '13 CHI '13 Extended Abstracts on Human Factors in Computing Systems, Pages 1815-1820, May 2013 https://dl.acm.org/citation.cfm?id=2468682