SHAPE Corpus workshops 2017

Day 1 Creating corpora from early language sources (tapes, manuscripts, published works)

Day 2 What we can only learn from corpora?


We are not organising accommodation but have an arrangement with the Travel Inn Best Western who have offered the following prices (tell them you are attending the CoE event).

$155.00 Room only

$165.00 Room and Breakfast (1 person)

$175.00 Room and Breakfast (2 Person Share)

(Including wifi and carparking)

2nd May 2017

Creating corpora from early language sources (tapes, manuscripts, published works)

venue William Macmahon Ball Theatre, Old Arts, University of Melbourne
(organised by Jenny Green and Nick Thieberger)

Description: This workshop addresses processes involved in creating usable forms of text from early sources. Examples include: making structured dictionaries out of paper-based books; conversion of digital files into usable formats; textual versions of early manuscript records; useful ways to organise and transcribe dynamic media (audio, video) including crowdsourcing. We will also discuss rights and permissions that need to be considered when using old sources. 
Note that there is a $30 registration fee for this workshop.

Program (see the list of abstracts here)

0900, May 2nd, 2017


0915 – 0945

Historical sources for language revival in Victoria
Kris Eira, Victorian Aboriginal Corporation for Languages


Setting up the Howitt-Fison Archival Corpus
Patrick McConvell (ANU/WSU) & Rachel Hendery (WSU)


The dictionary of the Tahitian Academy: from the Word file to the digital database

Jacques Vernaudon, Hugues Talfer & Nick Thieberger
University of French Polynesia & University of Melbourne



Nyungar song in early written sources
Dr Clint Bracknell
The University of Sydney


Ngarda-ngarli thabi: building a database for a regional Aboriginal public song tradition of the Pilbara 

Andrew Dowding, Sally Treloyn, Reuben Brown, Jared Kuvent, and Nick Thieberger


Ken’s Kaytetye

Myfany Turpin, Ben Foley, Nay San and Amy Parncutt



Making the signs fit: From archive to ELAN and beyond

Jennifer Green

The University of Melbourne


Automatic alignment of mis-matched video and audio

Sasha Wilmoth (UQ/Appen), Ola Olsson (UQ), Felicity Meakins (UQ)


Engaging a crowd for a common goal …Aboriginal and Torres Strait Islander language transcription activities @ SLNSW

Melissa Jackson, Indigenous Services Branch, SLNSW




You must be in tune with the times and prepared to break with tradition:lessons learnt in working with the Ngarrindjeri corpora

Mary-Anne Gale

Research Fellow, University of Adelaide


The Living Archive of Aboriginal Languages

Cathy Bow


Compilations and Copyright

Thomas Allen





See the program for day 2 below

3rd May 2017

What we can only learn from corpora?

venue Sidney Myer Asia Ctr-106 (Yasuko Hiraoka Myer Room), University of Melbourne
(organised by Stefan Schnell and Nick Thieberger)

Description: For the workshop, we seek presentations of research on specific linguistic topics for which aspects of corpus-linguistic / corpus-based methods make a particular difference - such that we learn things that we could not have learned otherwise - and thus a contribution to grammar writing, language typology, and/or linguistic theories. Possible corpus-linguistic methodological aspects include:
  • systematic retrievability of linguistic structures in context
  • quantifiability of linguistic structures
  • contributions through specific innovative techniques of corpus queries and/or corpus annotation
  • taking into account text varieties (genres, registers, styles)
  • use of stimuli in text elicitation within experimental setups (SocCog Family problem stories, Pear Film or Frog picture book re-tellings, etc)
  • diachronic corpora and language change
  • research uses of corpus data by scholars other than the compilers
  • experiences in replicability of linguistic research
  • representativeness of corpus data for linguistic systems
  • ...

Draft program / schedule:



9 – 9.30


9.30 – 11

Intoning Information: A Prosodic Corpus of Mawng Janet Fletcher, Ruth Singer, Hywel Stoakes (Melbourne)

The prosody of a sentence (e.g. stress, intonation, timing)  can communicate different kinds of meaning yet it remains one of the more intractable features of spoken communication in Australian languages.  Constructing a corpus of spontaneous, elicited and controlled Laboratory Phonology styled speech allows us to investigate different layers of prosodic form and function in the Australian Indigenous language Mawng.

A corpus-based approach to vocalic contrasts in Kaytetye Nay San (A.N.U.)

Kaytetye is an Arandic language which has been described as having only two phonemic vowels, /ɐ, ə/. The /ə/ vowel has been described as ‘featureless’—its realisation being determined by surrounding consonants—while /ɐ/ is consistently realised as a low vowel [ɐ]. These descriptions of Kaytetye vowels, however, relied on impressionistic observations, without acoustic accounts on the range of realisations of /ə/. The current talk presents some new results from a quantitative analysis of vowel variation in Kaytetye, and a technical illustration of the Kaytetye Phonological Corpus (KPHON).

11 – 11.30

Morning tea

11.30 – 1

Corpus-based insights into language development across genres Barb Kelly (Melbourne)

Automatic methods for classification of verb classes in Abui Frantisek Kratochvil (Singapore)

(download abstract pdf )

1 -2


2 – 3.30

What we can learn about social cognition in language from SCOPIC (the Social Cognition Parallax Interview Corpus) Nick Evans & Danielle Barth

From free pronoun to agreement marker: what we can learn from discourse data across genres and languages Stefan Schnell (Melbourne)

In this talk, I discuss the role of genre variation in the corpus-based investigation into the typologically ubiquitous development of person agreement systems from free pronoun systems. I present initial findings from investigations of a multi-genre corpus from Vera’a concerning the conditions of grammaticalisation of subject and object agreement, and findings from a cross-corpus and cross-genre investigation into possible motivations for ergative as opposed to accusative (or other) alignment in agreement systems. I will show that both naturalistic and stimuli-controlled text data have their merits, but that a focus on only one type of corpus data may yield severe drawbacks.

3.30 - 4

Afternoon tea

 4 - 5.30

Closing discussion

 6pmBook launch: "Something about Emus: Bininj stories from Western Arnhem Land" edited by Murray Garde 
7pm Dinner at University Cafe ($47, set menu with a cash bar) 

For presenters: Time slots for each talk are 45 minutes, including Q&A time. The idea is to outline a specific linguistic reasearch question, and then explain how your specific corpus study addresses the question in novel ways, and yields findings and insights that would not have been attainable without this dedicated corpus-linguistic approach. Thus, the additional time would be for elaborating on matters of corpus compilation, processing and analysis.

To register for either or both workshops please go to the site below before March 31st. Numbers are limited so we will be capping attendance. Please let us know if you have registered but are unable to attend so we can offer a place to the next on the list.

Organised by the Melbourne node of the Centre of Excellence for the Dynamics of Language