We propose new inference procedures robust to general forms of weak dependence in the sense that they are asymptotically valid if the target parameter can be consistently estimated at the parametric rate. The procedures compare a test statistic constructed using resampled data against a critical value constructed either through resampling or a normal approximation. Computation is simple and does not depend on the correlation structure of the data. We consider applications to settings with unknown or complicated forms of dependence, with network dependence as a leading example.
This paper analyzes a semiparametric model of network formation in the presence of multiple, unobserved, and agent-specific fixed effects. Given agents’ observed attributes, the conditional distributions of these effects, as well as the disturbance terms associated with each linking decision are not parametrically specified. I give sufficient conditions for point identification of the coefficients on the observed covariates. This result relies on the existence of at least one continuous covariate with unbounded support. I provide partial identification results when all covariates have bounded support. Specifically, I derive bounds for each component of the vector of parameters when all the covariates have discrete support. I propose a semiparametric estimator for the vector of coefficients that is consistent and asymptotically normal as the number of individuals in the network increases. Monte Carlo experiments demonstrate that the estimator performs well in finite samples. Finally, in an empirical study, I analyze the determinants of a friendship network using the Add Health dataset.
I study a regression model in which one covariate is an unknown function of a latent driver of link formation in a network. Rather than specify and fit a parametric network formation model, I introduce a new method based on matching pairs of agents with similar columns of the squared adjacency matrix, the ijth entry of which contains the number of other agents linked to both agents i and j. The intuition behind this approach is that for a large class of network formation models the columns of this matrix characterize all of the identifiable information about individual linking behavior. In the paper, I first describe the model and formalize this intuition. I then introduce estimators for the parameters of the regression model and characterize their large sample properties.
The assumption that data samples are independent and identically distributed (iid) is standard in many areas of statistics and machine learning. Nevertheless, in some settings, such as social networks, infectious disease modeling, and reasoning with spatial and temporal data, this assumption is false. An extensive literature exists on making causal inferences under the iid assumption [16, 13, 23, 19], but, as pointed out in [17], causal inference in non-iid contexts is challenging due to the combination of unobserved confounding bias and data dependence. In this paper we develop a general theory describing when causal inferences are possible in such scenarios. We show that, under certain conditions, it is possible to identify counterfactual distributions in causal models which allow both dependence, and unobserved confounding. We use segregated graphs [18], a generalization of latent projection mixed graphs [24], to represent causal models of this type and provide a complete algorithm for non-parametric identification in these models. We then demonstrate how statistical inferences may be performed on causal parameters identified by our algorithm, even in cases where parts of the model exhibit full interference, meaning only a single sample is available for parts of the model [21].
Causal inference in network analysis can be complicated by interdependencies in the data. But we argue that fundamental principles of causal inference apply nonetheless. In particular, careful attention to model specification, especially using longitudinal data, can mitigate bias more so than can the estimation technique. We attend particularly to estimation of the influence model, which has received relatively less attention in the statistical literature than the selection model. We present a simulation and algebraic proof that ordinary least squares regression can yield unbiased estimates of influence models that carefully leverage longitudinal data. Recognizing that specification cannot reduce all bias, we encourage the use of sensitivity analyses to quantify how much of an estimate must be due to bias to invalidate an inference.
In most real-world systems units are interconnected and can be represented as networks consisting of nodes and edges. For instance, in social systems individuals can have social ties, family or financial relationships. In settings where some units are exposed to a treatment and its effects spills over connected units, estimating both the direct effect of the treatment and spillover effects presents several challenges. First, assumptions on the way and the extent to which spillover effects occur along the observed network are required. Second, in observational studies, where the treatment assignment is not under the control of the investigator, confounding and homophily are potential threats to the identification and estimation of causal effects on networks. Here, we make two structural assumptions: i) neighborhood interference, which assumes interference to operate only through a function of the immediate neighbors' treatments, ii) unconfoundedness of the individual and neighborhood treatment, which rules out the presence of unmeasured confounding variables, including those driving homophily. Under these assumptions we develop a new covariate-adjustment estimator for treatment and spillover effects in observational studies on networks. Estimation is based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors' treatment. Adjustment for propensity score is performed using a penalized spline regression. Inference capitalizes on a three-step Bayesian procedure which allows taking into account the uncertainty in the propensity score estimation and avoiding model feedback. Finally, correlation of interacting units is taken into account using a community detection algorithm and incorporating random effects in the outcome model. All these sources of variability, including variability of treatment assignment, are accounted for in the posterior distribution of finite-sample causal estimands. We conducted a simulation study where we assess the performance of our estimator on different type of networks, generated from a stochastic block model and a latent space model or given from the friendship-network of the Add-Health study.
Clustered randomized trials (CRTs) are popular in the social sciences to evaluate the efficacy of a new policy or program by randomly assigning one set of clusters to the new policy and the other to the usual policy. Often, many individuals within a cluster fail to take advantage of the new policy, resulting in noncompliance behaviors. Also, individuals within a cluster may influence each other, for instance treatment exposure by those who comply with treatment may affect outcomes for those who refuse treatment. Here, we study the identification of causal effects in CRTs with both noncompliance and treatment spillovers. We prove that standard causal estimand, the complier average treatment effect, is not the estimated under the standard analysis. We also show that there does not exist an unbiased estimator for relevant causal estimands and instead, we provide bounds. We demonstrate our results with an analysis of data from a deworming intervention in Kenya. We find that given high levels of compliance, we can are able to place informative bounds on the total effect among the compliers. The results indicate the the intervention effectively reduced infections among those who complied with their treatment status.