The Brandeis-Simmons Corpus of English VP Ellipsis is a joint research project between Brandeis and Simmons Universities, led by Professor Lotus Goldberg (Brandeis University) and Associate Professor Amber Stubbs (Simmons University).
The goal of the project is twofold:
To investigate the identity constraints and other key questions regarding the syntactic analysis of the English VP Ellipsis construction, and by extension, of ellipsis and null anaphora more generally.
In order to support goal 1: to create the largest single annotated corpus of English VP Ellipsis, and make it available as a research resource.
Goal 2 will also support improving ellipsis resolution in NLP research, as the corpus will be available to the NLP community and formatted for use in machine learning as well
The corpus creation process has two stages:
Phase 1: identifying instances of VP Ellipsis from various sources. We began with the Brown Corpus and Penn TreeBank, sources of written corpus data widely used in computational linguistics/natural language processing research, including prior VP Ellipsis corpus work. We have augmented these sources with substantial amounts (forming the majority of the corpus' content) of transcribed naturally occurring spoken English—a crucial trait for formal syntactic investigation—from other spoken English corpora, along with additional television and radio news shows and podcasts (e.g., PBS NewsHour, Code Switch). This part of the project is nearing completion, with over 6,000 VP Ellipsis examples collected thus far, and is on track to be completed around the close of 2023. Please see the annotation guidelines for more information.
Phase 2: adding to each VP Ellipsis example a syntactically detailed annotation scheme that goes far beyond the prior VP Ellipsis corpus work, allowing a deep examination of each example's structural traits. This is broken up into two parts: Phase 2a, in which the boundaries of the antecedent clause, antecedent VP, and clause containing the elided VP are identified, and a paraphase for the elided VP is created — and then Phase 2b, in which labels for the morphological, syntactic, and basic discourse traits of each example are added. The Phase 2a and 2b annotation guidelines are currently being piloted, and samples of our results will be posted here in the coming months, along with the final version of the annotation guidelines themselves. Please watch this space for further developments!
To cite this project, please use: Goldberg, Lotus and Amber Stubbs. 2020. The English VP Ellipsis Corpus. Available from: (link to site goes here)