Research & funding

My preferred theoretical framework is usage-based/experience-based linguistics. My research interests all broadly fall within the remit of variationist linguistics and variation studies, including their interfaces with typology, geolinguistics, and psycholinguistics. I view linguistic variation as a window into the hidden structure of human language and the nature of linguistic knowledge. My research interests specifically include:
  • variation studies (synchronic & diachronic)
  • probabilistic grammar
  • sociolinguistics and register analysis
  • language complexity
  • geolinguistics, dialectology & dialectometry, and dialect typology
  • learner language

Recent representative publications

  • Szmrecsanyi, Benedikt, Jason Grafmiller & Laura Rosseel (2019).  "Variation-Based Distance and Similarity Modeling: a case study in World Englishes". Frontiers in Artificial Intelligence 2:23.
    DOI: 10.3389/frai.2019.00023  | open access here 
  • Ehret, Katharina & Benedikt Szmrecsanyi (2019). "Compressing learner language: an information-theoretic measure of complexity in SLA production data". Second Language Research 35(1): 23-45.
    DOI: 10.1177/0267658316669559 | manuscript 
  • Szmrecsanyi, Benedikt, Jason Grafmiller, Joan Bresnan, Anette Rosenbach, Sali Tagliamonte & Simon Todd  (2017).  "Spoken syntax in a comparative perspective: the dative and genitive alternation in varieties of English". Glossa: a journal of general linguistics. 2(1): 86.
    DOI: 10.5334/gjgl.310 (open access
  • Szmrecsanyi, Benedikt (2016). "An analytic-synthetic spiral in the history of English". In: Elly van Gelderen (ed.), Cyclical Change Continued. Amsterdam: Benjamins, 93-112.
    DOI: 10.1075/la.227.04szm | uncorrected page proofs

    Current funded projects

    • Exploring probabilistic grammar(s) in varieties of English around the world
      Applicant and PI
      Funded by a Type II Odysseus grant awarded by the Research Foundation Flanders (FWO) (grant # G0C5913N, budget: €856,260)
    The project is situated at the crossroads of research on English as a World Language, usage-based theoretical linguistics, variationist linguistics, and cognitive sociolinguistics. It specifically marries the spirit of the Probabilistic Grammar framework (which posits that grammatical knowledge is experience-based and partially probabilistic) to research along the lines of the "English World-Wide" paradigm (which is concerned with the dialectology and sociolinguistics of post-colonial English-speaking communities around the world). The overarching objective is to understand the lectal plasticity of probabilistic knowledge of English grammar, on the part of language users with diverse regional and cultural backgrounds.
    • The register-specificity of probabilistic grammatical knowledge in English and Dutch
      Applicant and PI, with Jason Grafmiller and Freek Van de Velde (Co-PIs)
      Funded by the Research Foundation Flanders (FWO) (grant # G0D4618N, budget: €229,000)
    Probabilistic grammars regulate the way in which we choose between different ways of saying the same thing. For example, in English people can say either Tom sent Mary a letter, or Tom sent a letter to Mary. Both syntactic variants have roughly the same meaning, and we know that variant choice is a function of precisely quantifiable effects of probabilistic factors such as the length of the theme, or the pronominality of the recipient. The question the project is asking is if language users have different probabilistic grammars for different types of speech situations – in other words, do our linguistic choice making processes differ depending on whether we engage in e.g. informal conversation or write blog entries? The project will tackle this question empirically by investigating the register-specificity of grammatical variation in English and Dutch. The contrastive variation analysis will rely on both corpus evidence (i.e. observation) and rating task experiments. 

    • Nephological Semantics: using token clouds for meaning detection in variationist linguistics
      Co-PI with Dirk Geeraerts, Stefania Marzo & Dirk Speelman
      Funded by a C1 grant awarded by the KU Leuven Research Council (grant # 3H150305, budget: €1,271,200)
    The increasing importance of corpus data in linguistics creates a need for appropriate methods for retrieving semantic information from corpora. In the project proposed here, existing computational methods of distributional corpus semantics are further developed in the form of a meaning detection approach based on token clouds, i.e. clusters of distributionally similar attestations of words or expressions in a multidimensional vector space. The first phase of the project has a methodological orientation, focusing on the finetuning of such a 'nephological' method for detecting linguistic meanings in corpus data. In the second phase of the project, the method is put to use in two descriptive research lines: lectometrical research into the relationship between language varieties, and variationist grammar research.
    project website

    • North and South, bottom to top: using big data to model syntactic variation in Belgian and Netherlandic Dutch
      Co-PI with Dirk Speelman, Stefan Grondelaers, and Antal van den Bosch
    • Funded by a "Letteren, Nijmegen en Leuven" (LN&L) grant (budget: approx. €100,000)
    While Belgians and Dutchmen are well aware that they use different words, and that their pronunciation diverges, they are mostly oblivious to the fact that there are also grammatical discrepancies between Belgian and Netherlandic Dutch. Few Belgians, for instance, will realize that the preposition voor in Jan maakte (voor) haar een boterham is optional for them, whereas it is indispensable for almost all the Dutch. How come there are such outspoken syntactic differences between two varieties (in a comparatively small language area) which did not begin to diverge before the 16th century? And where do these differences come from? In order to answer these questions, we draw on large subtitle and newspaper corpora, and marshal machine translation, machine learning, and automated semantic classification technologies to access the syntactic motor, or motors, of Dutch.