Multivariate Spatio-Temporal Data

Figure 1: In (a) [(b)], we display Quarterly Workforce Indicators (QWI) for women (men) in the education (agricultural) industry, and the second quarter of 2006 (first quarter of 2001).

Many data sources provide information over different spatial regions and time points to provide a context to understand the data. For example, cancer mortality rates tend to be higher in southeast counties of the US. Similarly, cancer mortality rates have been increasing regularly each year (see Bradley et al., 2014). However, data sources rarely provide "spatio-temporal data" over just a single variable; for example, federal datasets are tabulated by age, gender, and race. To verify this for yourself go to the data sources in the “E-Resources” link. Moreover, to preserve the privacy of individual respondents, a majority of these public use federal datasets consist of spatial data that are referenced over geographic regions (i.e., areal data) and not point-referenced data. Thus, there has been a growing number of multivariate spatio-temporal areal datasets becoming available. Despite the prevalence of multivariate spatio-temporal areal data there are very few methods available to analyze these datasets (see the review in Bradley et al. (2014)).

There are many issues that plague multivariate spatio-temporal areal datasets, but an immediate difficulty is that the size of the datasets are often very large. Thus, in Bradley et al. (2014), we consider a modeling approach for high-dimensional multivariate spatio-temporal areal data. As an example, consider  the Quarterly Workforce Indicators (QWI) published by the Longitudinal Employer-Household Dynamic (LEHD) program. In Figure 1, we display estimates of the average monthly income of individuals that have a steady job. The QWIs in Figure 1 is presented for two spatial fields. However, the entire dataset consists of many more QWIs including: each quarter in the years 1990 to 2013 (A total of 92 discrete time-points), each of the 3,145 US counties, both genders, and 20 industries. The entire dataset consists of 7,530,037 QWIs, which is an extremely high-dimensional dataset that poses considerable methodological challenges. 

Figure 2: In (a) and (b), we present LEHD estimated average monthly income (US dollars) for the state of Missouri, for each gender, for the education industry, and for the first quarter of 2013. LEHD does not provide estimates at every county in the US at every quarter; these counties are shaded white. In (c)-(f), we present the corresponding maps (for the state of Missouri, for each gender, for the education industry, and for quarter 92) of predicted monthly income (US dollars), and their respective posterior square root MSPE. Notice that the color-scales are different for each panel.

To solve this problem of processing a high-dimensional multivariate spatio-temporal areal data, we provide many methodological developments. To quickly summarize the methodological contributions of Bradley et al. (2014): we introduce a new reduced rank dynamic first-order linear model, an innovative parameter model, an extension of the Moran’s I basis functions to the multivariate spatio-temporal basis functions, and a new class of propagator matrices for a first order vector autoregressive (VAR(1)) model.

Using this technology we provide predictions of QWIs at missing counties, and provide associated measures of uncertainty. In Figure 2, we display predictions of average monthly income for males and females in the education industry for the last quarter of 2013 - note that predictions are available over the whole United States, over each quarter from 1990 to 2013, and for each gender/industry combination.

Articles:

Data: