Measuring Corruption & Forecasting Social Unrest:
Recent Applications of Natural Language Processing Techniques and Crisis Prediction Modeling at the IMF

Chris Redl, Sandile Hlatshwayo @ International Monetary Fund

Abstract:

Corruption is macro-relevant for many countries, but is often hidden, making measurement of it—and its effects—inherently difficult. The first paper leverages news media coverage of corruption by constructing the first big data, cross-country news flow indices of corruption (NIC) and anti-corruption (anti-NIC) using more than 665 million news articles. These indices correlate well with existing measures of corruption but offer additional richness in their time-series variation. On average, NIC shocks lower real per capita GDP growth by 3 percentage points over a two-year period, illustrating persistence in the effect of such shocks. Conversely, there is suggestive evidence that anti-NIC efforts appear to have a sustained positive macro impact only when paired with meaningful institutional strengthening. In separate work, we produce a social unrest risk index for 125 countries covering a period of 1996 to 2020. The risk of social unrest is based on the probability of unrest in the following year derived from a machine learning model drawing on over 340 indicators covering a wide range of macro-financial, socioeconomic, development and political variables. The prediction model correctly forecasts unrest in the following year approximately two-thirds of the time. Shapley values indicate that the key drivers of the predictions include high levels of unrest, food price inflation and mobile phone penetration, which accord with previous findings in the literature.

Bios:

Chris Redl is an economist in the Asia Pacific Division at the International Monetary Fund where he contributes to the IMF’s Regional Economic Outlook for Asia and the Pacific. His research has focused on exchange rates, the measurement and impact of economic and political uncertainty and economic forecasting using non-traditional data such as text. Previously he worked in the Strategy, Review, and Policy department covering emergent macroeconomic risks relevant to the Fund’s member countries. Prior to joining the IMF, he worked at the Bank of England and Her Majesty’s Revenue & Customs in the United Kingdom; and the University of the Witwatersrand in South Africa. He is a research fellow for the Centre for Data Analytics for Finance and Macroeconomics at Kings College in London and was previously a visiting research fellow at the South African Reserve Bank. He holds a Ph.D. in Economics from Queen Mary, University of London, a master’s degree in Economics from the London School of Economics and the University of the Witwatersrand, and a bachelor’s degree in Economics, Finance and Philosophy from the University of Witwatersrand.

Sandile Hlatshwayo is an economist in the Strategy, Review, and Policy department at the International Monetary Fund where she helps evaluate macro-relevant risks across the Fund’s 190 member countries through crisis prediction modeling, text-based analytics, and strategic foresight activities (e.g., policy-gaming). Her primary research interest is quantifying the domestic and international consequences of policy uncertainty. She also has previously engaged in country-specific policy work on Madagascar and Fiji. Outside of her professional obligations, she mentors; sits on the boards of Black Professionals in International Affairs and the Sadie Collective; is a Council of Foreign Relations term member; and serves on the American Economic Association’s Committee on the Status of LGBTQ+ Individuals in the Economics Profession. Prior to her graduate studies, she worked in the private sector at Procter & Gamble in South Africa. She holds a Ph.D. in Economics from UC Berkeley, a master’s degree in policy studies from Stanford University, and a bachelor’s degree in Economics and Political Science from Spelman College.

Summary:

Forecasting social unrest
- ML-based forecasting model of unrest
- Clarify which socioeconomic variables are useful for this
- We don’t currently have a solid theory of unrest
- Output: a risk index
- Base dataset:
  - Database of unrest events
  - They classified events by category: government, democratic, elections, global issues, religious, etc.
  - Collect news text that includes counts of various unrest-related words
  - Various datasets of the economy, natural disasters, poverty, etc.
  - Observation: on average GDP drops by ~ 1% after unrest events (analysis controls for endogeneity as much as feasible, using unrest in neighboring countries as the instrumental variable)
- Prior literature lists a set of possible drivers: food prices, inequality, competition between elites, social media, weak growth
- Modeling
  - Target variable: unrest 1 year in the future (binary)
  - Cross-validation: train on time series prefix, predict for next year (avoids contamination of results from future events)
  - They tried various models
    - Neural nets and linear models didn’t work well
    - Best options were tree-based models
    - Achieved 66% AUC (significantly better than chance: 50%)
  - Talk goes over several case studies: UK, US (works better), Egypt, Thailand (works poorly)
  - Key drivers:
    - Unrest in previous year (autoregressive)
    - Inflation
    - Contagion from neighboring countries
    - Use of digital media
Modeling corruption in countries
- Paper: The Measurement and Macro-Relevance of Corruption: A Big Data Approach
- Model generations
  - 1st: Perception-based (corruption indexes, control of corruption, etc.)
  - 2nd: Victimization & Indicator-based (surveys of the bribes people paid and their actual experiences)
  - 3rd: Big Data (e.g. ipaidabribe.com, procurement analyses, news flow index)
- Big data approaches are more neutral and can be collected with higher frequency
- Observation from other work: large shocks tend to shift beliefs about society/economy (e.g. financial crash on savings, COVID on work from home)
- Created NIC:
  - measures news flow about corruption, not corruption directly
  - 30 economies 1995-2017
  - 665+ million articles from Dow Jones news aggregator
    - Core:
      - mentions of country, corruption, public sector, major sources, >99 words
      - No anti-corruption
      - No own-country sources (bias about own country reporting)
    - Extensions:
      - Measures of lobbying (e.g. legal corruption)
      - Measures of anti-corruption
  - Performed human audits of whether the matching rules catch appropriate articles
    - Some bad matches: scammers who pretended to be government or bribing corporate workers
    - Most article matches are good though
  - Advantages:
    - No reliance on official sources,
    - High frequency
    - Saliency for economic actors
    - No over reliance on local experts
  - Limitations:
    - Depends on press, which may not be free (constrained by government or corrupt themselves)
      - This was motivation for removing own-country sources (also own country sources are to sensitive to small stories)
    - Private vs public corruption
    - Countries with systemic corruption may not get that many reports because they’re not “new”
  - Observation: high correlation between NIC and other measures of corruption
  - Identified shocks in NIC time series, predicted impact of NIC shocks on GDP growth
    - Observed ~3% drop in growth over 3 years
    - Much larger impact in high-corruption countries (there is less trust that corruption spikes will be dealt with)
  - Computed impact of anti-corruption news on GDP growth: minimal impact
  - But, anti-corruption * institutional strengthening has a strong impact on GDP growth

Measuring Corruption & Forecasting Social Unrest: Recent Applications of Natural Language Processing Techniques and Crisis Prediction Modeling at the IMF

Measuring Corruption & Forecasting Social Unrest:
Recent Applications of Natural Language Processing Techniques and Crisis Prediction Modeling at the IMF