Wisdom of the Crowds or Ignorance of the Masses? A data-driven guide to WallStreetBets
(with Dragos Gorduza, William Wildi, Xiaowen Dong, Stefan Zohren)
This paper presents a data-driven guide to the WSB forum - targeted at academics and practitioners. The paper uncovers text-based and user pattern-based asset clusters within the WSB forum by applying large language models and network techniques to WSB submission and comment data. In addition to the analysis, the paper presents the dataset of hand-annotated due diligence posts, posts labeled with a sentiment classifier, as well as an interactive dashboard, to promote further exploration and research.
Accepted at Journal of Portfolio Management
Social Contagion and Asset Prices: Reddit's Self-Organised Bull Runs
(with Julian Winkler)
Can unstructured text data from social media help explain the drivers of large asset price fluctuations? This paper investigates how social forces affect asset prices, by using machine learning tools to extract beliefs and positions of 'hype' traders active on Reddit’s WallStreetBets (WSB) forum.
Awards: Winner of the second prize for Rebuilding Macroeconomics competition
In the media: LSE Business Review, Financial Times.
From Micro to Macro: understanding the social dynamics behind political and economic change
(with John Pougué-Biyong)
This paper studies the evolution of the Brexit discussion over time. We begin by outlining and testing a novel signed, temporal clustering algorithm on static and temporal, synthetically-generated data. The algorithm highlights periods of social turmoil (July-September 2019) and of relative stability (May-August 2020). Our proposed metric for community overlap over time correlates with unemployment and the GBP/USD exchange rate, linking the data on social discussions to the macroeconomy.
Winning paper, and published as special issue of, Complexity in Social Macroeconomics Research Prize (2022)
DEBAGREEMENT: a comment-reply dataset for (dis)agreement detection in online debates
(with John Pougué-Biyong)
In this paper, we introduce DEBAGREEMENT, a dataset of 42,894 comment- reply pairs from the r/BlackLivesMatter, r/Brexit, r/climate, r/democrats, r/Republican forums, annotated with agree, neutral or disagree labels. We evaluate the performance of state-of-the-art language models on a (dis)agreement detection task, and investigate the use of contextual information available (graph, authorship, and temporal information).
Published in NeurIPS 2021