Bad DataInterest Group
Bad DataInterest Group
Many statistical methods, and essentially all machine learning techniques, are limited to situations in which precise data is abundant and obeys particular assumptions. Some statistical techniques have been developed for situations in which some of the assumptions can be partially relaxed, though they often still make assumptions that may be untenable in practice. Likewise, some statistical methods allow for cases where sample size is small. However, not all uncertainty has to do with small sample sizes. Poor or variable precision, missing values, non-numerical information, unclear relevance, dubious provenance, and contamination by outliers, errors, and lies are just a few of the causes that give us bad data.
Data can be bad if it is imprecise, or available in only small sample sizes, or if it otherwise fails to conform to the assumptions you need to be able to make to apply the analysis you hope to apply. Although it is fashionable lately to talk about big data, and how it will transform engineering and our society, we should understand that big data may turn out to be bad data if it is measuring the wrong thing or is imprecise or non-numerical, or has imbalanced design. The available analytical methods for handling data may not be up to the task if they cannot take a proper account of bad data. Some basic questions about bad data seem not to have clear answers:
• When investing in empirical effort, should we prefer to get more or better data?
• Is it reasonable to combine good data with bad data?
• What can we do if it is clear that our data are not collected randomly?
• What can be done with ludicrously small samples like n=8, or n=2, or even n=1?
• If data aren’t missing “at random”, can we still draw any conclusions?
• Is it okay to ignore, as statisticians so often do, the reported precision statements associated with measurements?
• How should we characterize the uncertainty of naked numbers that have no accompanying details about their uncertainty?
• How can we make use of values established by so-called expert elicitation (i.e., asking one or multiple domain authorities)?