In July 2020 there was a vote in Russia, on making Putin president for life. Sadly, it passed. But in a few days following the vote, the government made a mistake of putting the data online, in a scrapeable form, so some nice people scraped it. As both turnout and support were faked, if you plot "vote for the winner" against "vote turnout" (after Klimek 2012), you can see a doctored "right-corner-blob" emerge from the "natural blob" in the middle; and also this doctored blob is heavily striated, as cheaters go for round percent values, both for turnout, and final vote results.
Immediately after the vote, I ran these analyses and published them on Twitter. The public evidently like the presentation, as the thread went viral, and got 1.4k retweets and 3k likes within about 2-3 days.
I also tried some less standard analyses; for example, estimating whether it is likely that each particular number, reported by a polling place, was a result of some official multiplying whole numbers on a calculator. If you create an index of "suspiciousness" like that, you can try identify regions that were more eager to engage in voter fraud (as shown on this plot on the left).
More analyzes, and some interesting cases by region: Twitter thread
Since 2018 I collaborate with a Draftsim.com (a training site for people who want to get better at "Matic the Gathering") on small data-driven projects. Using the data anonymously collected by the site I found ways to:
describe and classify drafting strategies for every set
create "footprints" of each card set, allowing interesting speculations about why some sets feel more fun to play, or why certain strategies work or don't work for this set in particular
quantify and visualize changes in player's strategies over time, as they become more familiar with the set
ways to identify cards that create public controversy, and and drafted very differently by different players
The main goal of this project was to build narrative for analytical pieces, to get traction on social medial, and create discussions about the data we presented. As MtG is inherently a nerdy game, we found that by injecting a bit of math and pretty visualizations in the story, we make everyone happier (our team first of all, but also the readers), which boosted user engagement on the site.
Also, as a side-effect of these analyses, we wrote a bunch of cool AI agents for playing MtG, and described them in a preprint.
Technical links:
Publicaton (pre-print): Ward, H. N., Brooks, D. J., Troha, D., Mills, B., & Khakhalin, A. S. (2020). AI solutions for drafting in Magic: the Gathering. arXiv preprint arXiv:2009.00655.
Popular articles:
Intro to MtG draft analysis - how to approach MtG drafts from the data science point of view
Evolution of drafting strategies - how to notice changes in players behavior over time
Where human players and AI players disagree - detecting and explaining controversial cards
In the fall of 2020 I was asked by a small college (not Bard) to look into their enrollment, and check if they can noticeably reduce the chances of internal covid outbreaks by moving a relatively small share of their courses online.
Working from the enrollment records, I reconstructed the network of expected student interactions, and identified courses that brought most students together (not only because of their size, but also because of their place within the network). I then ran stochastic SIR models on these networks, and predicted the effect that switching different courses to remote instruction would have.
Unfortunately, while my analysis found the optimal sequence in which course were to be moved online, it also showed that all of these attempts would be pretty much futile, as the enrollment network in a small liberal arts college is way to connected to effectively fragment it into subnetworks. So ultimately, the main take-home message of this analysis was that there was no silver bullet.
Note that in many networks we'd actually expect the opposite result: in Barabasi networks for example we can often reduce cases by ~20% by eliminating only one node. But not in this IRL example, unfortunately.
Links:
By the end SfN meeting in 2014 I was tired of attending posters, and so decided to run a small data science project instead. For 749 posters in the last session, I recorded how many people were listening to the speaker at that random moment when I strolled by, and whether the speaker presented as a man or a woman. It seems that men and women were presented equally well, and their posters got similar attention from spectators on average. However a disproportionately large share of "extremely hot" posters (those that gathered a crowd) were presented by men (p=0.01, exact Fisher test). More details: here.
My highest-cited blog post so far. My original claim (as described here) was that one needs about 100 points of cumulative impact factor to successfully land a Tenure-Track position in research university. I'm no longer sure that it is necessarily true; to give one example, I got a TT job with a cumulative impact factor of about 10. But then again it was in a SLAC, which is very different from a research uni.