Demographic Labeling in Big Data Research
Approaches and advocacy around demographic labeling and curation
Written by the CHIRON Project Team
Published on May 7, 2024
Approaches and advocacy around demographic labeling and curation
Written by the CHIRON Project Team
Published on May 7, 2024
Far from a neutral or objective choice, choices in data curation and analysis, especially about how demographic groups are lumped, split, or excluded entirely, can meaningfully impact how researchers, policy makers, clinicians, and patients are able to make use of health research.
Multidisciplinary researchers have recognized the urgency of standardization in this space. The PhenX Toolkit is a multifaceted set of validated protocols for quantitative research on human subjects, informed by scholars across disciplines (including CHIRON academic workgroup member Maile Taualii). An area of particular attention by both researchers and community members is demographic categorization, often pertaining to race, ethnicity, sex, and gender.
The National Academies of Sciences, Engineering, and Medicine (NASEM) released a 2023 report on the use of population descriptors in genetics and genomics research, an essential read for researchers in this field. Their brief set of recommendations provides guidance that may be relevant more broadly, instructing researchers to avoid the term Caucasian, which is rooted in white supremacy,1,2 and advising that researchers “disclose the process by which they selected and assigned group labels and the rationale for any grouping of samples.”
Innovations in the application of population descriptors often come from communities persistently asserting that commonly used categorization schemas do not meet their needs. Many Pacific Islanders, for instance, reject the grouping Asian American Pacific Islander (AAPI), describing how this grouping strategy can invisibilize their data (as a proportionally smaller group in many US cities) and, subsequently, resource allocation toward Pacific Islander people specifically. Critiques are also leveled at the grouping Asian American itself for its inadequacy at capturing the diversity of people included in this group. Since 1997, Asian and Native Hawaiian and other Pacific Islander have been disaggregated from each other on the US Census, and Census takers are able to self-identify into subgroups.
The US Census is a persistent site of discourse on race and ethnicity categories. After extensive advocacy, the Office of Management and Budget (OMB) is adding Middle Eastern or North African as a new category distinct from white, including on the next Census. Race and ethnicity will also be asked as one, multi-selection question on the next Census, which was well-received by some Afro-Latino respondents in OMB’s preliminary work but has been criticized in other venues for its potential impact on data from Afro-Latinos.
Indeed, new approaches to standardization do not always work for everyone. In addition to their guidance on population descriptors in genetics research, NASEM released updated guidance in 2022 on asking about sex and gender in surveys. The committee recommends a two-step approach, inquiring first about sex assigned at birth and then about current gender, with the options female / male / transgender / Two-Spirit (if respondent is American Indian or Alaska Native) / I use a different term: [free text]. While these questions, when combined, are meant to assist researchers in obtaining specific data on transgender participants, critical appraisal of this approach notes that it forces, for instance, trans women to choose between transgender and female, and the lack of cisgender language reinforces cisgender experience as a status quo. Critics note that this forced choice is a function of categorizing transgender as a gender rather than a gender modality. The Sexual and Gender Minority Interest Group at the National Cancer Institute (NCI) published a comprehensive set of revisions to the NASEM recommendations for measuring sex and gender, using their own expertise and approaches used in the PRIDE Study. While the recommendations of both NASEM and the NCI researchers are tailored toward primary data collection, both approaches are meant to enable large-scale data collection on sex and gender, where collecting highly specific information can affect analytical power. As such, secondary researchers may also find this discussion enlightening, inasmuch as sex and gender categorization is part of their work.