Language Documentation Tools Summit

Topics for discussion, including but not limited to:

Big Questions

• What are the most pressing technological needs in making better records of the world’s small languages?

• What efforts are being made by current researchers to address these needs?

• How can these efforts be coordinated to maximise the possibility of interoperability?

• What obstacles to more efficient work practices could be overcome by a targeted effort of programming over the next few years?

• What emerging tools or methods can we look to and invest in?

• Without singling out anyone, in the past there have been large infrastructure projects that have developed guidelines and frameworks, sometimes getting to the level of being functioning systems, but ending up without content or users.

• Why are there so few digital repositories for all the material being created by documentation projects?

• Can we identify the successful systems/ tools we use and why they are successful? (e.g. OLAC; Toolbox; ELAN)

Theme 1: Archiving, Discovery, (re)Use (theme leader: Linda Barwick)

• How can we build on the foundation provided by OLAC to maximise discoverability of existing material? • How can archived language resources be made more useful in terms of citation, persistent identification, ease of access, and the development of ‘landing pages’ that describe the collections? • How can we extend the number of archives and the reach of archives to include more records, especially at-risk legacy records? • Archiving software, visualisation of language collections

Theme 2: Workflows, Interoperability (theme leader: Sasha Arkhipov)

• What is the range of workflows (from recording through to the archive) that are used in LD projects and how can they be improved?

Workflow blockages: how much is the lack of interoperability of our tools preventing the development of well constructed corpora? (problems: assigning metadata to items; knowing what has been transcribed, annotated, interlinearised; moving from complex multi-tier transcription to interlinearisation and losing part of the transcription, etc).

• Interoperability and the outputs of LD (standards for all kinds of material created by LD)

• Standard formats for complex annotation/IGT

• Metadata entry tools to help organise collections and prepare them for archiving

Theme 3: Data Enrichment (theme leader: Caroline Jones)

• Recording and transcribing/annotating recordings (HTK, e.g. MAUS – forced alignment).

• Eventual automatic transcription

• Distributed annotation (including crowdsourcing):

online systems for annotating page images of notes (archival manuscripts: handwriting recognition)

annotating dynamic media

interlinearising annotations

• What emerging tools or methods can we look to and invest in?

• Increasing scope of recordings (e.g., Aikuma)

• Delivery of language records for speakers (phone apps, HTML5 services from archives)

• Dictionary creation and presentation systems (online, app-based)

Theme 4: Corpora, Scale (theme leader: Steven Bird)

• Corpus development for small languages: what standards should we be adopting or developing for corpora of small languages that may be different to those in use for large languages?

What frameworks are there that small textual/media corpora can be placed into for general use (e.g., developing EOPAS.org.au)

Interfaces, models and technologies for mobile language apps (scaling up recording and delivery)

See the Program

Please note that registrations are closed and this conference is now full!

June 1-3, 2016, University of Melbourne

This conference will bring together people interested in building better tools and methods to document small languages. While primarily aimed at tool developers and at the linguists using the tools, a major outcome of our work will be increased access for speakers to better records of more languages than is currently the case.

This conference will help to set the agenda for collaboration on standards and tool development and provide the CoEDL with direction for investment of funds. Attendance will be by invitation and will target those actively working in the area.

There were similar workshops/conferences run by emeld.org in the USA a decade ago, and a Digital Tools Summit in the Humanities was also run at Virginia in 2005.

We will develop some discussion points before the conference that will be distributed; participants will be appointed to lead discussion in working groups. Each developer will present their software and outline their development plans, noting what kinds of problems they have encountered and what users have said about the software. There will be a process for scribing results and developing a report back from working groups.

We want to explore innovative tools and methods and identify current problems for fieldwork, recording, transcription, analysis, archiving and accessibility of language material.

Background reading

Emily M. Bender & Jeff Good. 2010. A Grand Challenge for Linguistics: Scaling Up and Integrating Models

A set of useful links provided by participants on the registration page:

Tools and services

Global Open Resources and Information for Language and Linguistic Analysis (GORILLA) is a project bringing together an interdisciplinary community of linguists, anthropologists, and computer scientists to collaborate on creating the tools for automatic or semi-automatic transcription and analysis of audio and visual information, focusing on low-resourced languages of the world: http://gorilla.linguistlist.org/

Lexicon Enhancement via the GOLD Ontology (LEGO) is a project to establish tools and standards to facilitate the sharing and interoperation of lexical data: http://lego.linguistlist.org/

MultiTree is a searchable database of hypotheses on language relationships: http://www.multitree.org/

The purpose of the GOLD Community is to bring together scholars interested in best-practice encoding of linguistic data, promoting best practice as suggested by E-MELD, encouraging data interoperability through the use of the GOLD Standard, facilitating search across disparate data sets and providing a platform for sharing existing data and tools from related research projects: http://linguistics-ontology.org

CMDI is a tool for generating IMDI metadata: http://cmdi-maker.uni-koeln.de

The Language Archive Core and Language Archive Core (Repository) profiles aim to provide a basic metadata profile for language repositories: https://github.com/fxru/LangArchCore-Metadata

Poio is open source technology for language diversity: http://www.poio.eu

Datavyu is a complete software package for visualizing and coding behavioral observations from video data sources. http://datavyu.org/user-guide/index.html

Munich AUtomatic Segmentation: https://clarin.phonetik.uni-muenchen.de/BASWebServices/#/services/WebMAUSGeneral

Databrary is a video data library for developmental science: https://nyu.databrary.org/

LaBB-CAT is a browser-based linguistics research tool that stores audio or video recordings, text transcripts, and other annotations. Annotations of various types can be automatically generated or manually added. https://labbcat.canterbury.ac.nz/system/

Computer Tools for Field Linguistics and Language Documentation [Russian only]: http://languedoc.philol.msu.ru:8082/fieldling/

Converter from the old transcriber format to the Toolbox (“standard”) format (text with markers), creating a file that can later easily be imported to ELAN [site currently down/being redeveloped?]: http://linguisticsoftwareconverters.zong.mine.nu/

Documentation index. A service to present OLAC records and to visualise amount of material available per language (draft version): http://mlr-au.github.io/pdsc-olac-visualisation/app/#/

Language (group) specific projects

Croatian Language Repository http://riznica.ihjj.hr

CorpAfroAs is an integrated pilot project realized by field linguists for field linguists and typologists for the documentation of Afroasiatic languages: http://corpafroas.huma-num.fr/

Description, typology and documentation of the languages of Senegal [French only]: http://senelangues.huma-num.fr/

SEAlang Library Ahom Dictionary Resources, based on Ahom texts transcribed, transliterated, and translated by the Ahom Dictionary Resource Project: http://sealang.net/ahom

This tool searches data collected by Stephen Morey and associates in Assam from 1996 to present: http://sealang.net/assam

ZongList is a web based dictionary framework for Tima, a language of the Nuba Mountains, Sudan: http://tima-dictionary.mine.nu/

Within the DOBES programme, the following documentation projects are documenting highly endangered languages, including Saliba and Logea (PNG): http://dobes.mpi.nl/projects/saliba/

Iltyem-iltyem is an online resource for sign languages used in Indigenous communities in Central Australia: http://iltyemiltyem.com/sign/

This portal provides information on the linguistic diversity of the Brazilian Amazon and its documentation and information on relevant activities of the Linguistics Department of the Emilio Goeldi Museum [Portuguese only].www.museu-goeldi.br/linguistica

'Landing page' guide for a language: http://bit.ly/SouthEfate

Cross-linguistic projects

Indigenous Peoples Committee of Taiwan: http://web.klokah.tw/text/read.php?tid=308

(click the 中 symbol for Chinese and then right-click to translate page to English) http://web.klokah.tw/video/

This web site contains supporting electronic material for the Atlas of Pidgin and Creole Language Structures (APiCS) http://apics-online.info/

The aim of the CorTypo project is the elaboration of an innovative system of linguistic annotation of natural language corpora in lesser-described spoken languages, in view of testing linguistic hypotheses on spontaneous discourse data, in a typological perspective: http://cortypo.huma-num.fr/

[authentication required to see site] http://3pant.mine.nu/

This project investigates the encoding of events involving three participants in human language: http://dobes.mpi.nl/research-projects/cross-linguistic-patterns-in-the-encoding-of-three-participant-events/

ODIN stands for the Online Database of Interlinear Text, a collection of interlinear glossed text (IGT) instances extracted from linguistic documents on the Web: http://faculty.washington.edu/fxia/odin/

Language CoLLAGE (Collection of Language Lore Amassed through Grammar Engineering) is a collection of grammatical descriptions of 50 languages (and counting) developed on the basis of the LinGO Grammar Matrix in the context of Linguistics 567 at the University of Washington: http://www.delph-in.net/matrix/language-collage/

Automatic Generation of Grammars for Endangered Languages from Glosses and Typological Information: http://depts.washington.edu/uwcl/aggregation/

Cross-Linguistic Linked Data - helping collect the world's language diversity heritage: http://clld.org

Multi-CAST (Multilingual Corpus of Annotated Spoken Texts) is a collection of non-elicited, spoken texts from different languages, most of them monologic narratives: https://lac.uni-koeln.de/en/multicast/

Corpora for various languages [site mostly in Russian] http://web-corpora.net/

Javascript visualiser from PARADISEC that knows about .eaf, .trs, .flextext: https://github.com/MLR-au/pdsc-collection-viewer/