Machine learning (ML) involves a feedback loop in which the model makes a prediction, that prediction is validated as right or wrong, and the validation is used to improve subsequent predictions. Validation is performed by comparing the predication to an existing, tagged dataset or using people to make the decision. Human in the Loop (HITL) ML is the latter method, in which people manually validate the data to decrease errors as the ML model is trained (1). People validate data (text, images, audio, video, etc) that are not yet labeled, difficult to auto tag, or is rapidly evolving (1).
HITL can be used in multiple parts of the ML process, where people:
build the model itself.
train the model to improve its predictions.
label the data the model will be trained on, which is called data labeling (1).
We focus on data labeling as it relates to content moderation and labor practices.
Data labeling includes many different tasks, such as tagging, classifying, moderating, and processing. Labeled data is defined as annotated data that shows the target prediction for the model. The labels reflect data features that can be used to find patterns and influence the model's predictions.
Data labeling is determined as a need if there is unlabeled data, low quality data labels, and/or inefficiency in the in-house data labeling process performance and/or cost-wise (2). As a result, many companies look into accessing thousands of workers outside of their company through digital labor platforms to perform data labeling to address these issues. This is called crowdsourced data labeling.
Digital labor platforms refers to the web-based platforms and apps that outsource work from all over the world and are a major part of the gig economy (3). The gig economy is composed of temporary and part-time workers (gig workers) who often use these digital labor platforms to find work. Many digital workers work in what is referred to the Global South -- regions that have not held historical dominance and total 80% of the world's population (4). Although gig workers have flexibility and independence in this industry, there is almost no job security, and the industry as a whole is largely unregulated and rapidly evolving (3). Examples of crowdsource platforms include Amazon Mechanical Turk, UpWork, and CrowdSpring.
Data labeling is one task that gig workers engage in.
Online users are continuously at risk of being exposed to graphic (violent, sexually explicit, illegal) content when using the Internet. Content moderation is when the data is filtered to remove graphic content from the dataset or platform (5). Previously, content moderation was performed wholly by people, but AI content moderation is an attempt to prevent this exposure (5). However, there is still human involvement in the development of AI content moderation tools as data labeling is used in this process.