a project by T.J. Gaffney and Dan Smith.
There's a subreddit (a community on the website Reddit) call r/borrow where users request small, short-term, high-yield loans. These loans are considered high risk; there is no collateral or way to pursue defaulted loans. However with the high yield, there is an opportunity to lend profitably; some high-profile lenders have managed to make good money. The problem is interesting, in part, there is a lot of data available: The request for a loan comes with a short message, and we have a full history on the user, including with their borrowing history.
We set out to model probability of a default, to then use to make loans. This page discusses our approach. The project breaks out into three major pieces: Scraping data, modeling, and implementing.
Partial code is available here.
We took advantage of the fact that every Reddit webpage has a JSON version. So we pulled these. Most of the code available above deals with parsing, in the helper libraries.
From the loan request, we parse things like title, description, borrower, amount requested, amount to repay, repayment date, currency, country, and payment medium (PayPal, Venmo, etc.). We did a lot of work to get the borrower's loan history nailed down. Every loan should have a follow-up post for paid, unpaid, or paid late. We looked for these, and pulled which of these it was, what the comments, and heading were. We had to find these follow-up posts, which sometimes didn't exist, and match to the loan request, which was sometimes difficult. This was somewhat difficult, and very important; historical repayment was the most predictive variable, and early on we discovered missed opportunities stemming from mislabeled data. The subreddit had it's own bot that attempted the same thing, and made a post on each request. We pulled this as well, and built some rules around how to reconcile disagreements between us and the bot.
We pulled historical Reddit-wide data from the user making the request. As a first pass we pulled things like age of account, total number of comments, and total karma (measure of comment quality). These ended up being pretty predictive. In a second pass, we broke down comment and karma history by subreddit. [See clusters section below.]
One problem we had early on was that the program was running way too slow. We had saved local copies of the Reddit page for each loan. And originally we had a lookup table that listed each loan with the borrower and lender. We found that a vast majority of runtime was spent doing reverse lookups on this table. So we rearranged into a binary search tree. [The code is available here.] Today I would use a hash-map, but we had come up with this solution and it sped us up dramatically. [About 90% as I recall.]
We had done a lot of the modeling in R. We tried using XGBoost. We had set up a cross-validation and used this to do parameter tuning. The model hastily over fit, so we ended up using fewer, shallower trees. A lot of the work went into feature engineering. The most predictive variables ended up being user history and payout history, and both these yield a lot of possible representations.
We realized in the course of modeling that we had a censorship problem, in that: The lowest-quality loans were mostly unfulfilled, and we weren't able to include these in our model, because we modeled Yes/No repaid. In fact we found that a majority of loans were repaid, but it would not be a correct conclusion that most requests will be repaid. This doesn't strictly affect the model, but it does undersample bad risks, and it may almost complete exclude obvious red flags from our data set.
At one point we wanted to mine some more data from the users history because things like karma and number of posts were very predictive, and our model could tolerate more variables. We decided to try to break out karma and number of posts by subreddit where they occurred. Because there are thousands of (common) subreddits, we didn't want to do a full break-out so we decided to try to cluster these. After some research we decided to build a complete graph of subreddits where weights are portion of subscribes common, and then break up the graph using a Markov clustering algorithm. [I actually can't remember and can't find evidence if we implemented ourselves or used the results of another's analysis posted online.] The resulting clusters were intuitive, for example: obama, history, goverment in cluster 3 or gameofthrones, harrypotter, doctorwho in cluster 6.
These variables ended up not being among the most predictive, though that could be because the signal was spread so wide. Their inclusion overall improved the model.
We set up a service (available here) that repeatedly searched for new posts, ran our model on them, and sent us an email about the post. This then sent us emails like this throughout the day:
As well the emails showed us the variables that were most important in the prediction. We showed what the delta was if we had not included this variable in the model. This helped us debug as well as making decisions about loans.
We decided that at first we would restrict ourselves to small, low-risk loans. As well, we learned that PayPal has a route for you to get money back if service (in this case repayment) was never delivered. [This didn't work every time.] This wasn't captured in the model, because these wouldn't be marked as repaid, so we made it a rule to only use PayPal.
Results and Conclusions
We implemented our model, only taking small loans. After making about a dozen of these loans, we experienced a much higher default rate than we had anticipated, and lost some money. The defaults we were experiencing were outside a reasonable range of cross-validation experiments. This is why we think that we were losing money: The loans that we were most excited about got filled extremely quickly, and often we couldn't offer to fill those in time. We carefully analyzed the loans that we made, but there were some borrowers where we must have missed some red flag; these loans we had no problem filling. We think that disproportionately we were getting too many of these.
We had decided that we should improve our model before continuing. It would take us a lot of time to make the necessary improvements, and in fact we learned that repeated communication with borrowers / evaluating loans took a long time. It's not surprising in retrospect that making money at this would take a lot of work. At the same time, we were getting busier with work and other projects, and we put this on hold indefinitely.
If we were to continue this, we would build tools to get to loans quicker, and we would also capture time-to-fill historically, so that we could account for that in our cross-validations.
We would also leverage existing tools. In the course of building this out, we found some loan statistics on sites like redditloans.com and some user statistics on sites like redditinvestigator.com. We even found eventually found a blacklist of borrowers shared by some lenders. If I was starting this project today, I would begin by looking for some of these resources.
And finally, we would take a more long-term approach. The code isn't too bad, but could be written to be more maintainable. We didn't have any plans for how to iterate. We had probably underestimated the scope of the problem on our first go.