My short(ish) description: With access to a powerful but untrusted prover, we show that a wide range of distribution properties can be verified with a very small number of conditional sampling queries and sublinear communication (in the support size).
Long Abstract:
We revisit the framework of interactive proofs for distribution testing, first introduced by
Chiesa and Gur (ITCS 2018), which has recently experienced a surge in interest, accompanied
by notable progress (e.g., Herman and Rothblum, STOC 2022, FOCS 2023; Herman, RAN-
DOM 2024). In this model, a data-poor verifier determines whether a probability distribution
has a property of interest by interacting with an all-powerful, data-rich but untrusted prover
bent on convincing them that it has the property. While prior work gave sample-, time-, and
communication-efficient protocols for testing and estimating a range of distribution properties,
they all suffer from an inherent issue: for most interesting properties of distributions over a
domain of size 𝑁, the verifier must draw at least Ω(√𝑁) samples of its own. While sublinear in
𝑁, this is still prohibitive for large domains encountered in practice.
In this work, we circumvent this limitation by augmenting the verifier with the ability to
perform an exponentially smaller number of more powerful (but reasonable) pairwise conditional
queries, effectively enabling them to perform “local comparison checks” of the prover’s claims.
We systematically investigate the landscape of interactive proofs in this new setting, giving
polylogarithmic query and sample protocols for (tolerantly) testing all label-invariant properties,
thus demonstrating exponential savings without compromising on communication, for this large
and fundamental class of testing tasks.
My short(ish) description: We improve the accuracy of privately and continually releasing several counting problems (such as distinct element count, triangle count in a graph, and degree histograms) by arguing that we can use a wide range of factorization mechanisms for summing binary bits under novel sensitivity relations that we define, that are well suited to these upstream tasks.
Long Abstract:
We study differentially-private statistics in the fully dynamic continual observation model, where many updates can arrive at each time and updates to a stream can involve both insertions and deletions of an item. Earlier work (e.g., Jain et al., NeurIPS 2023 for counting distinct elements; Raskhodnikova & Steiner, PODS 2025 for triangle counting with edge updates) reduced the respective cardinality estimation problem to continual counting on the difference stream associated with the true function values on the input stream. In such reductions, a change in the original stream can cause many changes in the difference stream, a challenge in applying private continual counting algorithms to obtain optimal error bounds. We improve the accuracy of several such reductions by studying the associated ℓ2-sensitivity vectors of the resulting difference streams and isolating their properties.
Our improved accuracy stems from applying various factorization mechanisms for the counting matrix. While our primary application is to counting distinct elements, we demonstrate how our framework extends naturally to the estimation of degree histograms and triangle counts (under a slightly relaxed privacy model), thus offering a general approach to continual cardinality estimation in streaming settings. The key technical challenge is arguing that one can use state-of-the-art factorization mechanisms for sensitivity vector sets with the properties we isolate. In particular, we show that for approximate DP, the celebrated square-root factorization of the counting matrix (Henzinger et al., SODA 2023; Fichtenberger et al., ICML 2023) can be employed to give a constant-factor improvement in accuracy over past work based on the binary tree mechanism. We also give a tight analysis of b-ary tree mechanisms with subtraction under these sensitivity patterns, including a polytime routine for exactly computing the ℓp sensitivity, yielding time and space efficient algorithm for both pure and approximate DP. Empirically and analytically, we demonstrate that our improved error bounds offer a substantial improvement for cardinality estimation problems over a large range of parameters.
My short(ish) description: Existing definitions of privacy for large data releases do not directly tie to realistic privacy harms that could be caused by the release, and as a result often are criticized for being overly conservative/stringent at significant cost to utility. Motivated by this, we give a new necessary condition for privacy (called demographoic coherence) that is directly inspired by privacy attacks, and grounds itself in a realistic adversarial model. We show that our definition is provably weaker than differential privacy, and corresponds to natural experimental setups to build intuition about privacy risks.
Long Abstract:
The technical literature about data privacy largely consists of two complementary approaches: formal definitions of conditions sufficient for privacy preservation and attacks that demonstrate privacy breaches. Differential privacy is an accepted standard in the former sphere. However, differential privacy's powerful adversarial model and worst-case guarantees may make it too stringent in some situations, especially when achieving it comes at a significant cost to data utility. Meanwhile, privacy attacks aim to expose real and worrying privacy risks associated with existing data release processes but often face criticism for being unrealistic. Moreover, the literature on attacks generally does not identify what properties are necessary to defend against them.
We address the gap between these approaches by introducing demographic coherence, a condition inspired by privacy attacks that we argue is necessary for data privacy. This condition captures privacy violations arising from inferences about individuals that are incoherent with respect to the demographic patterns in the data. Our framework focuses on confidence rated predictors, which can in turn be distilled from almost any data-informed process. Thus, we capture privacy threats that exist even when no attack is explicitly being carried out. Our framework not only provides a condition with respect to which data release algorithms can be analysed but suggests natural experimental evaluation methodologies that could be used to build practical intuition and make tangible assessment of risks. Finally, we argue that demographic coherence is weaker than differential privacy: we prove that every differentially private data release is also demographically coherent, and that there are demographically coherent algorithms which are not differentially private.
My short(ish) description: Estimating the density of a distribution in the Wassersein distance is a fundamental problem, and worst-case notions of optimality for this problem fail to accurately capture the performance of estimation algorithms, since they don't distinguish between algorithms that perform badly on all distributions, and algorithms that perform badly on a single distribution. We suggestions definitions of instance-optimality for this problem that benchmark the adaptivity of the algorithm to the 'hardness' of the distribution, and characterize the instance-optimal rates in terms of distribution-dependent parameters.
Long Abstract:
Estimating the density of a distribution from samples is a fundamental problem in statistics. In many practical settings, the Wasserstein distance is an appropriate error metric for density estimation. For example, when estimating population densities in a geographic region, a small Wasserstein distance means that the estimate is able to capture roughly where the population mass is. In this work we study differentially private density estimation in the Wasserstein distance. We design and analyze instance-optimal algorithms for this problem that can adapt to easy instances.
For distributions P over ℝ, we consider a strong notion of instance-optimality: an algorithm that uniformly achieves the instance-optimal estimation rate is competitive with an algorithm that is told that the distribution is either P or Q for some distribution Q whose probability density function (pdf) is within a factor of 2 of the pdf of P. For distributions over ℝ, we use a different notion of instance optimality. We say that an algorithm is instance-optimal if it is competitive with an algorithm that is given a constant-factor multiplicative approximation of the density of the distribution. We characterize the instance-optimal estimation rates in both these settings and show that they are uniformly achievable (up to polylogarithmic factors). Our approach for ℝ extends to arbitrary metric spaces as it goes via hierarchically separated trees. As a special case our results lead to instance-optimal private learning in TV distance for discrete distributions.
My short(ish) description: We study differentially private continual release of distinct element counts in streams where elements can be both inserted and deleted (i.e., turnstile streams). We show strong polynomial lower bounds on the error of any mechanism for this problem in terms of the stream length T, thereby proving that this setting is much harder than the insertion-only case. We then identify a natural parameter of input turnstile streams, that is low for many natural streams. We give tight error guarantees based on this parameter, bypassing the hardness of the turnstile model in many cases of practical interest.
Long Abstract:
Privacy is a central challenge for systems that learn from sensitive data sets, especially when a system's outputs must be continuously updated to reflect changing data.
We consider the achievable error for the differentially private continual release of a basic statistic---the number of distinct items---in a stream where items may be both inserted and deleted (the turnstile model). With only insertions, existing algorithms have additive error just polylogarithmic in the length of the stream T.
We uncover a much richer landscape in the turnstile model, even without considering memory restrictions. We show that any differentially private mechanism that handles insertions and deletions has a worst-case additive error at least T^{1/4} even under a relatively weak, event-level privacy definition.
Then, we identify a property of the input stream, its maximum flippancy, that is low for natural data streams and for which one can give tight parameterized error guarantees. Specifically, the maximum flippancy is the largest number of times the count of a single item changes from a positive number to zero over the course of the stream. We present an item-level differentially private mechanism that, for all turnstile streams with maximum flippancy w, continually outputs the number of distinct elements with an O(\sqrt{w} polylog T) additive error, without requiring prior knowledge of w.
This is the best achievable error bound that depends only on w, for a large range of values of w. When w is small, our mechanism provides similar error guarantees to the polylogarithmic in $T$ guarantees in the insertion-only setting, bypassing the hardness in the turnstile model.
My short description: Replicability, defined in recent work, is a strong notion of algorithmic stability that facilitates the simple and efficient verification of conclusions in published empirical studies. In this work, we show that replicability is statistically equivalent to other notions of algorithmic stability (differential privacy, perfect generalization), that at first glance, seem unrelated. We use these connections to solve open problems concerning all these stability notions. We also have cool results on the tightness of these connections (both computationally and statistically).
Long Abstract:
The notion of replicable algorithms was introduced in [Impagliazzo Lei Pitassi Sorrell' 22] to describe randomized algorithms that are stable under the resampling of their inputs. More precisely, a replicable algorithm gives the same output with high probability when its randomness is fixed and it is run on a new i.i.d. sample drawn from the same distribution. Using replicable algorithms for data analysis can facilitate the verification of published results by ensuring that the results of an analysis will be the same with high probability, even when that analysis is performed on a new data set.
In this work, we establish new connections and separations between replicability and standard notions of algorithmic stability. In particular, we give sample-efficient algorithmic reductions between perfect generalization, approximate differential privacy, and replicability for a broad class of statistical problems. Conversely, we show any such equivalence must break down computationally: there exist statistical problems that are easy under differential privacy, but that cannot be solved replicably without breaking public-key cryptography. Furthermore, these results are tight: our reductions are statistically optimal, and we show that any computational separation between DP and replicability must imply the existence of one-way functions.
Our statistical reductions give a new algorithmic framework for translating between notions of stability, which we instantiate to answer several open questions in replicability and privacy. This includes giving sample-efficient replicable algorithms for various PAC learning, distribution estimation, and distribution testing problems, algorithmic amplification of delta in approximate DP, conversions from item-level to user-level privacy, and the existence of private agnostic-to-realizable learning reductions under structured distributions.
My short description: This paper studies differential privacy in an (in my opinion) understudied streaming setting. It gives a general technique for lower bounds in this setting and uses it to show that there is no equivalent of the ‘exponential mechanism’ in this setting that achieves error logarithmic in both the number of candidates d and the number of time steps T. It also introduces a model of privacy that accounts for adaptive selection of inputs and shows that known algorithms satisfy this stronger notion of privacy.
Long Abstract: We study the accuracy of differentially private mechanisms in the continual release model. A continual release mechanism receives a sensitive dataset as a stream of T inputs and produces, after receiving each input, an output that is accurate for all the inputs received so far. In contrast, a batch algorithm receives the data as one batch and produces a single output.
We provide the first strong lower bounds on the error of continual release mechanisms. In particular, for two fundamental problems that are widely studied and used in the batch model, we show that the worst-case error of every continual release algorithm is T^{1/3} times larger than that of the best batch algorithm. Previous work shows only a poly-logarithmic (in T) gap between the worst-case error achievable in these two models; further, for many problems, including the summation of binary attributes, the poly-logarithmic gap is tight (Dwork et al., 2010; Chan et al., 2010). Our results show that problems closely related to summation---specifically, those that require selecting the largest of a set of sums---are fundamentally harder in the continual release model than in the batch model. Our lower bounds assume only that privacy holds for streams fixed in advance (the ``nonadaptive'' setting).
We also formulate a model that allows for adaptively selected inputs and prove the privacy of several known algorithms in that model. In particular, we show that our lower bounds are matched by the error of simple algorithms whose privacy holds even for adaptively selected streams. The adaptive model captures dependencies that arise in many applications of continual release.
My short description: We formalize and investigate the basic problem of differentially private sampling. Given n samples from a distribution P, a private sampling algorithm produces a constant number of samples from a distribution close in total variation distance to P while maintaining differential privacy w.r.t. its input. We show surprising relations between the number of samples needed for this problem and the related problem of private distribution estimation. To prove our results, we introduce new techniques for both upper and lower bounds.
Long Abstract: We initiate an investigation of private sampling from distributions. Given a dataset with n independent observations from an unknown distribution P, a sampling algorithm must output a single observation from a distribution that is close in total variation distance to P while satisfying differential privacy. Sampling abstracts the goal of generating small amounts of realistic-looking data. We provide tight upper and lower bounds for the dataset size needed for this task for three natural families of distributions: arbitrary distributions on {1,...,k}, arbitrary product distributions on {0,1}^d, and product distributions on {0,1}^d with bias in each coordinate bounded away from 0 and 1. We demonstrate that, in some parameter regimes, private sampling requires asymptotically fewer observations than learning a description of P non-privately; in other regimes, however, private sampling proves to be as difficult as private learning. Notably, for some classes of distributions, the overhead in the number of observations needed for private learning compared to non-private learning is completely captured by the number of observations needed for private sampling.
My short description: We give a general reduction from multiclass to binary differentially private PAC learning. We use this technique to obtain sample complexity bounds for multiclass private PAC learning that are exponentially better than the best-known prior to our work.
Long Abstract: We show a generic reduction from multiclass differentially private PAC learning to binary private PAC learning. We apply this transformation to a recently proposed binary private PAC learner to obtain a private multiclass learner with sample complexity that has a polynomial dependence on the multiclass Littlestone dimension and a poly-logarithmic dependence on the number of classes. This yields an exponential improvement in the dependence on both parameters over learners from previous work. Our proof extends the notion of Ψ-dimension defined in the work of Ben-David et al. [JCSS '95] to the online setting and explores its general properties.