Learning Over Dirty Data

GitHub

IDEA Lab

It is widely known within the data science community, that data cleaning is one of the most time consuming and difficult aspects of the machine learning process. However, without first cleaning the data, the model becomes nearly worthless. Given that most data in the entire world appears in a dirty state, it becomes an extremely important area of research to see how to automate the cleaning process, or even better, render it unnecessary in the machine learning process.

What do I aim to do?

The end goal of my line of research is to eliminate, not automate, the data cleaning process. Our work aims to learn directly over the dirty data without having to clean it, while still achieving a high level of accuracy to the ground truth model (model if trained over completely clean data).

What does dirty data look like?

Dirty data can take a variety of forms. The easiest forms of dirty data to understand take the form of outliers and null values.

Computational problems

The main problem with repairing data, especially in the case of null values, is that it the dirty data's clean form could be anything. This creates a nearly infinite repair space that requires assumptions to minimize. Even with assumptions, the repair space could be extremely large for a large dataset with many dirty data points.

Our Approach

The graphic above shows the dependent and inverse relationship between compute cost and coverage (level of repair space considered). The more that you aim to cover, the more computation is required. Our approach aims to improve and modify existing calculations within machine learning models to improve coverage, while limiting the computational expense. This is done by using common information amongst repairs as well as creative and novel semantics for learning.