Making the effect of assumptions about data clear in analyses
Physical reconstructions in archaeology today usually reveal what material or parts are authentic (whether actual remains or models cast from actual remains) and what interstitial material or parts are interpolated or assumed by the archaeologist. This can be done, for instance, by tinting the plaster or concrete used in the reconstruction. In the past, archaeologists often obscured this distinction in reconstructions, but modern attitudes and conventions insist on honest portrayal that distinguishes what is known from what is assumed.
This project seeks to extend this sentiment of honesty to statistical analysis and machine learning techniques that are applied to bad data. Bad data sets are those that contain non-negligible imprecision, censoring, inconsistent or missing data, fragmentary evidence, gaps, or elements of dubious relevance. Various strategies for interpolation and extrapolation and methods of data imputation are commonly used to fill in gaps and missing data, but the results depend on the assumptions implicit in the imputation methods employed. Likewise, many statistical methods to handle data censoring and missing data have been used, but they also depend on fairly strong assumptions about the process that generated the data. Although these assumptions are strong in that they can substantially affect quantitative results, they are often subtle and may seem innocuous when they are actually not. This project will develop methods that make transparently clear the effect of swapping in and out various underpinning assumptions to see how broad the ranges of possibilities in the results are.
Bad data is very common in many fields of science and engineering, and it is perhaps even more common among many fields of the humanities and the social sciences. Proper accounting of such bad data can be critical in archaeology, forensics, ecology, public health research, civil engineering, and social science research on issues from petty crime to human trafficking, in the sense that analyses that ignore the limitations of the data are likely to yield gravely misleading conclusions.
What will limit the information age? Many presume that what generally limits scientific understanding is the limited availability of data. While this is doubtless often true, there are domains where data, or at least information, is plentiful, but our understanding is still highly constrained by our capacity to integrate the available information into concise representations and our ability tease from it inferences that are relevant in making predictions and drawing conclusions about the world. There has lately been a lot of attention, and indeed hype, about big data that is or will soon be flooding in from the Internet of Things, and about how machine learning techniques are going to reveal patterns and the underlying processes and mechanisms that generate them. But it is a strange idea, perhaps really magical thinking, to suppose that simply wiring up to the internet everything from household appliances to traffic cameras to dam spillway controllers will somehow spontaneously erupt into a golden information age. We expect that a critical missing element preventing such a golden age is a battery of analytical tools capable of handling imprecise data.
PLURAL OR SINGULAR
Traditional education strongly argues that the word 'data' is plural so we should be saying 'data are...'. The singular form of the noun should thus be 'datum', from its root in Latin. However, modern English usage as well as technical language across computer science allows that 'data' is now a singular collective noun like 'gravel' or 'team'. We bow to this ahistorical trend, and we will write 'data is...'. But in recognition of our discomfort about this, we adopt the official joke of this webpage:
Question: What is the singular of data?
Answer: Anecdote.
There are many ways data can be bad. The oval below mentions several of the ways, but this tidy depiction is essentially a lie. Bad data can be bad in myriad and combined ways, and the complication that results for scientists is anything but simple. Quantitative approaches to handle many of these problems have long been a part of traditional statistical science. But they have often languished, unused in practical assessments, either because they required considerable sophistication on the part of the analyst or because they require untenable or uncheckable assumptions.
Sometimes data is both big and bad. For instance, cheap sensors can be widely deployed so they can collect data sets with very large sample sizes. The cheaper the sensors, the more that can be deployed. But the cheaper the sensor, the less the precision it can provide. A good example would be the use of smart phones to collect seismological data. Smart phones are equipped with simple accelerometers, which are used to discern how users are holding the phone. These accelerometers can be used as crude seismographs. Although much less precise than a seismograph, cell phones are vastly more numerous and often broadly distributed in spatial locations of high interest. We could use them to study earthquakes, but it would be impossible to ignore their imprecision in such studies.
Accounting for imprecision can be computationally difficult. The calculation of even simple statistics such as variance from imprecise data is known to be an NP-hard problem, which implies that computational complexity grows quickly as a function of the data sample size. This might be quite worrying because it could mean that handling imprecise data could be prohibitively expensive or even practically impossible when data sets are large. It turns out, however, that there are various special cases, including most cases commonly of interest for which the computational effort is actually a linear function of sample size. In practice, this renders the computations necessary for imprecise data sets only slightly more costly that for point data values. The graph below depicts the time (in milliseconds) to compute various statistics from data sets of various sizes composed on intervals. The calculation speeds are essentially similar to what we expect for ordinary data sets composed of point values.