Topics and Submissions


SANCL Background
 
More and more research in NLP is focusing on non-canonical data, creating the need for more accurate and more robust methods for POS assignment and syntactic parsing, since those processing steps are part of most NLP pipelines, be it in the area of machine translation, sentiment analysis or information extraction. Most applications would benefit from having access to higher quality syntactic representations. However, often it is not clear what a good representation should look like, or even what the appropriate unit of analysis should be. The nature of the application to some extent determines the syntactic analysis, e.g. those interested in second language acquisition may require a syntactic analysis of an L2 utterance which is qualitatively different to an analysis in which a grammatical error can simply be ignored.
 
Non-canonical Language
 
Our working definition of "non-canonical" is any type of language that falls under one of the following overlapping categories:
  1. Transcribed spoken language (from spontaneous conversation to scripted speeches)
  2. Learner language (in all forms including essays, answers to language exercises, transcribed dialogue)
  3. The language of social media (blogs and blog comments, forum posts, microblogs, consumer reviews)
  4. Computer-mediated communication in general (email, sms, chat)
  5. All forms of microtext including microblogs, sms and notes (e.g. clinical notes)
  6. Historical texts

Topics

 Papers submitted to the workshop should address the following topics:

- What is the best strategy for parsing non-canonical language? Should we make conventional tools more robust? Should we explore semi-supervised domain adaptation techniques? Should we develop individual tools for each non-canonical type?
- Can insights gained from parsing one type of non-canonical text help in parsing another?
- What are the challenges of handling the often heterogeneous nature of the data (e.g. code-switching)?
- What role does pre-processing play in the parsing of non-canonical data?
- To what extent is it necessary or desirable to perform full parsing for some kinds of non-canonical text?
- From a theoretical perspective, what are the appropriate analyses for non-canonical structures?
- How should new linguistic forms emerging from social media be analysed, e.g. the use of hashtags in Twitter?
- What is the optimal unit of analysis?
- For non-sentential units (frequent in spoken language) and especially for elliptical utterances: what kind of information is necessary for a
meaningful analysis? Depending on the application, categories like "NP" or "PP" might not sufficient.
- Can we use annotation schemes/tag sets developed for standard written text? Adapt existing schemes? Or start from scratch?
 
The SANCL workshop welcomes both theoretical and practical contributions for any grammatical framework, any parsing approach and any language.