Workshop on Scalability in Natural Language Processing

This workshop, held with RANLP 2013, aims to introduce contemporary work and to discuss novel methods for natural language processing at a large scale, and explore how the resulting technology and methods can be reused in applications both on the Web and in the physical world. 

What is scalable NLP?
For a processing approach to be scalable, it should be to take on large volumes of data; it can work through them at high speed; and it can smoothly adapt to changes in these needs. We discuss this in the context of NLP, with particular focus on the core tasks of resource creation, discourse processing, and evaluation.

Why is this workshop timely?
Now is a particularly important time to develop scalable methods in our field. Big data is here and the benefits of effectively getting through it remain to be harvested by the pioneers. Huge datasets are becoming available: Google Books contains 155 billion tokens, over which only shallow surveys have been conducted; the new Common Crawl web corpus contains over 60 terabytes of text and metadata. But size alone is not a driver for scalable methods – the rapid text content creation we see every day presents masses of data we are not yet equipped to handle. For example, Twitter alone is responsible for 500 million microtexts every day; the publicly-visible holds a part of the 2 million blog documents we create every 24 hours.

Why is this topic important?
As well as big text data becoming prolific, demand for this data is also high. The fast, un-curated nature of microtext has been shown to be of value in stock valuation by multiple researchers. User location and movement analysis enables powerful search and analysis modes, such as computational journalism and powerful personalisation. Sentiment detection informs corporations, governance and political activities. Media monitoring requires extracting and co-referring entities and events from thousands of outlets in real time. And finally, the emerging field of deep learning places but one core demand in all its guises: large amounts of data. All these applications' pressures create a demand for NLP that can be done quickly and broadly.

There is more demand than ever for scalable natural language processing. Many organisations are interested in the potential results as big data becomes better defined and data-intensive approaches to computational linguistics reach production-level performance. Enormous quantities of data, from user input to news archives, are being mined using more powerful and computationally demanding techniques. 

Newly introduced data-intensive approaches to computational linguistics continue thrive on input volume; we need scalable technology to handle the next order of magnitude in corpus sizes and, given the nature of language, to continue data-intensive advances our field.


With regard to Scalable NLP, we aim to encourage discussion regarding three key areas of natural language processing: resource creation; processing of discourse; and evaluation:

  • General scalability issues
    • Application approaches
    • Performance limits
  • Flexible resource creation
    • Parallelising annotation
    • Handling huge corpora
    • Crowdsourcing for corpus creation
    • Decomposing resource creation tasks
    • Rapid or realtime annotation quality assessment
  • Scalable processing
    • Running NLP in the cloud
    • Privacy issues
    • NLP application parallelisation
    • NLP application optimisation 
    • Scalable machine learning for NLP
    • High performance computing for NLP
    • Rapid evaluation
  • On-line learning for NLP
    • Reinforcement learning
    • Iterative and ensemble learning
    • Hypothesis generation

In addition to the invited talk and presentations, we intend to include a 30-minute hands-on demonstration slot with participants doing NLP in the cloud using the AnnoMarket platform, possibly including social media processing using GATE TwitIE (supported and funded by the organisers).

Draft timetable:
0930-1015: Invited talk
1015-1030: Discussion
1030-1100: Coffee break
1100-1300: Paper presentations and discussions
1300-1400: Lunch
1400-1430: Demonstration
1430-1600: Paper presentations and discussions
1600-1630: Coffee break
1630-1730: Panel and discussion: Scalable NLP and bridging the AI gap

The ScaNLP workshop is partially supported by GATE, the EU FP7 projects TrendMiner and AnnoMarket, and the CHIST-ERA uComp project.