Disclosure Limitation / Data Confidentiality

My exposure to the topics of disclosure limitation and data confidentiality began while I served as a postdoctoral researcher at the University of Missouri, working alongside Prof. Scott Holan and Prof. Christopher Wikle. My work in this area has been primarily focused on cases where the data are spatially-referenced and contain sensitive information and/or small counts which may lead to disclosure. In these settings, an attractive alternative to releasing suppressed or "perturbed" data (e.g., data with random noise added) is to release multiply imputed synthetic data. More specifically, we begin by fitting an appropriate model to the data under a Bayesian framework -- e.g., a log Gaussian Cox process model for modeling individuals' addresses or a conditional autoregressive model to model the number of events in a spatial region. Upon obtaining samples from the posterior distribution of the model parameters, we can then replace the original data with samples from the posterior predictive distribution, which can then be made publicly available.

More recently, my research in this area has pertained to the development of differentially private approaches for generating synthetic data, work which serves as the basis for my recently funded NSF CAREER award.

Differentially private data synthesis methods

Beginning with my experience working at the CDC in 2014-2016, data from CDC WONDER has provided a testbed for much of my methodological work in spatial statistics. One of the first challenges I encountered when working with data from CDC WONDER -- particularly after leaving the CDC and returning to academia -- was analyzing data in the presence of privacy protections in the form of small count suppression. To overcome this challenge, I leaned on my past experience analyzing left-censored occupational exposure data and subsequently published a manuscript targeted to users of CDC data that described how spatial models could be modified to analyze left-censored data.

Around the same time, I recognized that despite the availability of methods suited for the analysis of suppressed data from CDC WONDER, many users would inevitably design their studies to avoid encountering suppressed data altogether -- e.g., restricting analyses to urban areas or failing to stratify data by important factors such as age, race/ethnicity, and gender. Thus, I became motivated to fill in the gaps in CDC WONDER by developing methods for the creation of a (partially) synthetic CDC WONDER. Given my background in spatial statistics, my initial foray into this world was to simply use methods for the analysis of multivariate spatiotemporal data (Quick et al., 2017) to model the true data and then generate synthetic values from the posterior predictive distribution (see Quick and Waller (2018) below).

While that work was underway, I began to learn more about the formal privacy definition known as differential privacy. After reading the work by Machanavajjhala et al. (2008) underlying the US Census Bureau's OnTheMap tool, I was inspired to adapt their work to the setting encountered on CDC WONDER. Specifically, this involved generalizing their framework based on modeling the true data as being multinomially distributed with allocation probabilities that were assigned a Dirichlet prior to the setting where the true data were modeled as being Poisson distributed with known population sizes and event rates that were assigned gamma priors. This work got underway in my time as an ASA/NCHS Research Fellow at the National Center for Health Statistics (as chronicled in Amstat News), served as the foundation of my recently funded NSF CAREER award, and was subsequently published in JRSS-A.

Driven by the promise of the Poisson-gamma framework for generating differentially private synthetic data in the context of problems like CDC WONDER, the current objective underlying my work in data privacy is the continued evaluation and refinement of the Poisson-gamma framework toward its potential use for the creation of a differentially private "Synthetic CDC WONDER".

  • Quick, H. (2022). “Improving the utility of Poisson-distributed, differentially private synthetic data via prior predictive truncation with an application to CDC WONDER.” Journal of Survey Statistics and Methodology, 10, 596-617. [jssam]

    • This paper improves the utility of the Poisson-gamma framework by truncating the range of the synthetic data to a "plausible" range (as defined by the prior predictive distribution). This truncation ultimately improves the utility by significantly reducing the "informativeness" of the gamma priors (as measured by their shape parameters). That said, this approach is highly sensitive to the quality of the prior information used (e.g., "bad" prior information will lead to the synthetic data being restricted to a "bad" range of values); this should not be a problem for many applications in the context of CDC WONDER -- e.g., death rates from the same cause of death for a given demographic group are often comparable geographically -- but could pose a challenge in other settings.

  • Quick, H. (2021). "Generating Poisson-distributed differentially private synthetic data." J. Roy. Statist. Soc., Ser. A (Statistics in Society), 184, 1093-1108. [jrss-a]

    • This paper builds the foundation for the Poisson-gamma framework for generating differentially private synthetic data. In short, differential privacy can be satisfied for a given level of ε provided the gamma priors are sufficiently "informative" (as measured by the gamma distribution's shape parameters).

Application of spatial methods for data privacy

Unlike the differentially private methods described above -- which lack any underlying spatial structure -- the methods developed/used below explicitly leverage spatial and other dependence structures. In principle, synthetic data generated from these methods should retain properties of the original data (e.g., the spatial dependence structure) and should be safe to be released, but as is often the case, there can be exceptions. Investigating these potential exceptions and proposing solutions is an active area of my research.

  • Quick, H. and Waller, L.A. (2018). "Using spatiotemporal models to generate synthetic data for public use." Spatial Spatio-temporal Epidemiol., 27, 37-45. [sste]

    • This paper applies the MSTCAR model of Quick et al. (2017) to heart disease mortality data for the purpose of generating synthetic data for public use. By virtue of accounting for spatial-, temporal-, and between-age sources of dependence, we hope to produce synthetic data which possess greater utility than the left-censored data currently available from CDC WONDER.

  • Quick, H., Holan, S.H., and Wikle, C.K. (2018b). “Generating partially synthetic geocoded public use data with decreased disclosure risk using differential smoothing.” J. Roy. Statist. Soc., Ser. A (Statistics in Society), 181, 649-661. [jrss-a]

      • This paper highlights the risk of disclosure unique to observations with outlying point-referenced geographies (i.e., "spatial outliers"). In essence, typical spatial models tend to overfit spatial outliers (due to the lack of spatial neighbors for the model to "smooth" toward). This paper proposes an approach referred to as differential smoothing wherein the spatial random effects corresponding to spatial outliers are forced to be conditionally independent of the response variables, eliminating the potential for them to overfit.

      • A walk-through of the R code for the illustrative example is provided on the Code page

  • Quick, H., Holan, S.H., and Wikle, C.K. (2015). “Zeros and ones: A case for suppressing zeros in sensitive count data with an application to stroke mortality.” Stat, 4, 227-234. [stat]

      • Highlights the risk associated with failing to suppress zero counts when suppressing small, non-zero counts (e.g., when the number of cancer cases is between one and five). As a trivial example, consider a region with a population (or where a certain demographic group has a population) of just one. If I tell you a non-zero number of deaths from a rare event occurred in this region, there must have been only one death since we cannot have a proportion greater than 1. As such, this "interval censoring" approach failed to protect the region with only one event. In this setting, the obvious solution is to simply suppress zeros along with small, non-zero counts.

  • Quick, H., Holan, S.H., Wikle, C.K., and Reiter, J.P. (2015). “Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography.” Spatial Statistics, 14, 439-451. [spasta]

      • This paper discusses a framework for generating fully synthetic public use data which consist of exact point-referenced geographies and one or more attributes per geography (e.g., individuals' addresses and incomes).