The many research communities that rely upon Language Resources (LR) have benefitted from massive contributions from data centers, government agencies and research groups around the world. Nevertheless, research potential remains largely untapped because the LRs that fuel development fall far short of need as measured by volume, data type, and language coverage. Searches for data sets regularly go unfulfilled even for the dozen languages with the greatest populations and gross linguistic products.
Notwithstanding advances in data collection and processing, the supply of LRs continues to lag behind need in part because of the limited incentive models employed. Throughout the history of LR development, the commonest incentives offered to people in exchange for their contributions of raw language data and judgements were monetary. Perhaps this tendency is based on convenience or perhaps it reflects a belief concerning the ethics of data contribution. In any case, that bias has limited the LR user communities’ ability to collect data for example: in the absence of ready funding, in situations where funding cannot easily be transferred and, from groups, such as indigenous communities, with other motivations. The focus on monetary incentives has also limited opportunities to understand how other incentives might attract different workforces, what kinds of workflows might be optimal for such workforces and how their contributions could be integrated into research and technology development efforts.
Social media in contrast has employed a wider range of incentives including: access to information and entertainment; possibilities for self-expression, sharing and publicizing intellectual or creative work; chances to vent frustrations or convey thoughts sometimes anonymously; forums for socializing; situations in which to develop competence that may lead to new prospects; competition, status, prestige, and recognition; payment or discounts in real and virtual worlds; access to services and infrastructure based on contributions; opportunities to contribute to a greater cause or good.
Within HLT communities there have been a few projects that employ these incentives. SPICE provided contributors with access to a speech recognition system that was built from their own contributions. Let’s Go improved access to public transit. Herme offered the unusual experience of interacting with a tiny, cute robot. Crowd Curio offered experiential learning of e.g. historical linguistic behaviors. “On Everyone’s Mind and Lips” mapped the linguistic landscape of Austria. LanguageARC offers citizen linguists opportunities to contribute to research on timely issues such as bias in public discourse, documenting under-resourced languages and building normative models that can be used in the study of neuro-divergence and neurodegenerative disease.
However, outside our fields, and sometimes outside our reach, are efforts that employ variable incentives to a much greater effect creating massive LRs. LibriVox offer contributors the chance to create audio recordings of classic works of literature, develop their skills as reader and voice actors, work within a community of similarly minded volunteers and enable access to the blind, illiterate and others for whom existing versions were inaccessible. On the other hand, researchers cannot always rely on contributions from social media providers whose products are not always well matched to our research questions or who may be unable or unwilling to share their holdings in the ways that our research programs need.
Given the perpetual need for larger and more diverse LRs, the success of novel incentives in other fields that collect data from human contributors and the early successes and growth of interest among LR creators, this workshop will continue the discussion from the 2016 LREC Workshop on Novel Incentives in Data Collection and the 2020 LREC Workshop on Citizen Linguistics and Language Resource Development via a half day of papers and posters with optional demos. The workshop will also provide travel subsidies for the best student paper and the best paper that uses the LanguageARC citizen linguistics platform.
GOALS
The goal of this workshop is to encourage and provide a venue for research on novel incentives to supplement LR collection based traditionally on monetary compensation. By increasing the range of incentives offered we can increase the diversity of LRs available by reaching speaker groups that have been previously inaccessible and by enabling work on languages and topics that are not currently among any funder’s priorities.
Because linguistic innovation is effectively limitless, relying upon a limited resource, monetary compensation, to generate the data needed to document the world’s languages in all their situations of use is certain to fall short. Instead the community of language resource developers and users must develop and employ incentives that scale beyond the budget of 3- or 5-year programs. While a few innovative efforts employ novel incentives, they remain uncommon in our field even while they grow among social media providers.
Topics
In order to continue and expand the discussion on novel incentives this workshop will invite contributions on related topics including:
· projects that use alternate or novel incentives
· characteristics and performance of populations attracted by novel incentives
· modifications of the data collection and annotation tasking and workflows to accommodate new workforces, including the now familiar crowdsourcing approaches
· techniques for integrating the results of novel incentives, workforces and workflows into research
· legal and ethical issues related to novel incentive models
· other topics relevant to novel incentives in data collection from people
The workshop will also consider papers that discuss data collection efforts employing monetary compensation provided they compare to alternate incentives or address the issues of tasking, workflow or exploiting the results of the new workforce.
Presenting authors of Best Paper Employing LanguageARC and Best Student Paper will receive travel assistance to present during this workshop at LREC.
Submissions
We will accept papers between 4 and 8 pages excluding references. Accepted workshop papers will be published as workshop proceedings along with the main conference papers. Papers must follow the LREC 2022 style sheet and author’s kit templates. Papers are to be submitted via the workshop START page.
Important Dates
- submission deadline: April 8, 2022 April 15, 2022
- notification of acceptance: April 28, 2022
- deadline for camera-ready versions: May 23, 2022
Identify, Describe and Share your LRs!
Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.
As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2020 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.