Shane Bergsma
On Google Scholar


Advice - Teaching - Publications - Workshops - Data/Code













Presentation Materials:
    You can find slides for most of my conference presentations below with the corresponding publication. In addition, I also provide here the presentation materials for some recent invited talks and other presentations:
  • Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing, 2011 talks at Cambridge University, University of Pennsylvania (intended for an NLP audience). Slides in [pptx] [ppt] [pdf].
  • Three kinds of web data that can help computers make better sense of human language, Fall 2011 talks at York University, University of Saskatchewan, Stony Brook University (intended for general Computer Science audience). Slides in [pptx] [ppt] [pdf].
  • Coreference Resolution using Web-Scale Statistics, most recently a Fall 2011 lecture at Stony Brook University (intended for an NLP audience). Slides in [pptx] [ppt] [pdf].

JHU Research Workshops:

  • Software Projects:
    1. ArcFilter: An efficient program that vastly speeds up arc-based dependency parsing. It filters arcs from the dependency graph before parsing begins. Used in our recent COLING and ACL papers. [@GoogleCode]
    2. NADA: A robust program for detecting non-referential (a.k.a. pleonastic, expletive, dummy) pronouns. It takes tokenized English sentences as input and finds occurrences of the word 'it'. When an 'it' is found, the system outputs a probability for whether the 'it' is a referential instance, or instead a non-referential pronoun. Described in our DAARC 2011 paper. [@GoogleCode]
    3. Carmen: A Twitter Geolocation System. "Given a tweet, Carmen will return Location objects that represent a physical location. Carmen uses both coordinates and other information in a tweet to make geolocation decisions. It's not perfect, but this greatly increases the number of geolocated tweets over what Twitter provides." Described in our HIAI paper. [@GitHub]
    4. ngramtools: Tools for searching and lexical knowledge acquisition from Google N-grams [@GoogleCode]

  • Generally Useful NLP Data:
    1. Noun Gender and Number Data for Coreference Resolution. My most widely-used data, one of the standard resources in the Closed Task for the CoNLL 2011 Shared Task on Modeling Unrestricted Coreference in OntoNotes. Your coreference system should probably make use of it too! [GenderData]
    2. First name, last name, and location clusters from Twitter: Large-scale data mined from Twitter communication patterns. [Clusters]
    3. Distributional Clustering of Phrases: A clustering of a huge number of phrases from Google N-grams. [Clusters]

  • Training and Evaluation Code/Data:
    1. *Manually-Annotated Data for Language Identification in Twitter along with a Python-based language-ID system [Tweets]
    2. *Manually-Segmented Search Engine Queries and Feature Data. This query data has become a standard evaluation set for Information Retrieval research. [Queries]
    3. Annotated and processed ACL articles used in our work on Stylometric Analysis of Scientific Articles. [Labeled ACL Papers]
    4. Evaluation code and data for Learning Bilingual Lexicons from the visual similarity of Web Images. [Visual Lexicon Materials]
    5. Evaluation code and data for our Coordination Disambiguation project. [Coordination Materials]
    6. Evaluation code and data for our Visual Selectional Preference project. [Visual Selectional Preference Materials]
    7. Evaluation data for our Robust Supervised Classifiers project. [Robust Data]
    8. It-Bank: An online repository of labelled instances of the pronoun "it": [It-Bank]
    9. American National Corpus articles with Annotated Anaphora Resolutions: [Annotated Anaphora Data]
    10. Evaluation data used in our Alignment-Based Discriminative String Similarity project. [Cognates]