Analyzing Child Language Experiences around the World

A Digging Into Data Project

logo credit: Francesca Casillas

How universal are children’s language experiences across cultures and language communities?

Both the amount and the quality of the language children are exposed to influence children's language development. Certain language environments are associated with earlier language acquisition, greater proficiency, and better literacy outcomes. However, children successfully develop language in widely varying cultural and linguistic environments. It is therefore important to gain a better understanding of these differences (and similarities) in children's language experiences around the world.

Characterizing the variability in children’s language experience is very challenging using existing methods, making it difficult to test which characteristics of that experience are most important for children's language development. This project brought together diverse, naturalistic datasets and built state­-of-­the-­art language processing tools to measure the ​range​ and ​types ​of variability in children’s language experiences and relate this to variability in their language development.

Studies of child language have until recently focused on small samples of children whose interactions were laboriously transcribed in order to be analyzed. However, the recent advent of small, wire-­free recorders (e.g. LENA ), has allowed researchers to easily gather unscripted, everyday interactions between caregivers and children over the course of an entire day. New advances in speech processing software hold the key to a more automated approach to analyzing these thousands of hours of audio recording, something that would be impractical to attempt by hand. A key goal of this project is to make these diverse and largescale collections of audiorecordings of child language environments accessible, comparable and analyzable through the development of a shared annotation system and new tools for the automated analysis of noisy, real-world language recordings.

The ACLEW project seeks to build a cross-culturally valid description of children's real-world early language experiences with a highly diverse sample, a common annotation system, and new open-source tools for automated analysis.

The project requires complex coordination across a large group of researchers and laboratories, as illustrated here.

Creating a common annotation system for the datasets requires harmonizing decisions across cultural and language contexts.

Building tools to automatically analyze those datasets requires an iterative process where smaller datasets are used to develop tools, these tools in turn allow faster annotation as these datasets are expanded.