On Methodological Virtue

Methodological virtue is use-specific

Distinct methodological cultures arise from different communities working on distinct use-cases. Because the uses of data analysis are many, the subcultures are many. Indeed, subject matter areas independently developing statistical methodology has a long history: biometrics, psychometrics, econometrics, and now data science, among others. In his famous essay "Statistical Modeling: The Two Cultures" Leo Breiman considered the two data analysis cultures he was most familiar with, classical statisticians and algorithm-first proto-data-scientists. But a broader look is also instructive, because there are in fact far more than two cultures these days and my claim is that these cultures are best understood not in terms of the methods they use, but in terms of the goals they develop those methods towards. Understand the demands of a particular use-case and you will understand why certain methods are favored in a given culture. With this in mind, let's consider four dimensions along which data analytic cultures can vary: the estimands they consider, the manner in which their data are acquired, which modeling assumptions they embrace, and their treatment of statistical uncertainty.

Estimands

Estimating averages conditional on observed covariates (what is now called “supervised learning”) calls for different methods than estimating a population (marginal) average. Estimating quantiles or distributions requires different methods than estimating means.

One major dichotomy among estimands is whether they are purely predictive or causal: Do we care about prediction or causal prediction? That is, do we want to predict someone’s salary based on where they went to college, or do we want to predict what someone’s salary would be if everything else about them remained the same, but they were forced to attend a certain college? It could be argued that this distinction animates much of the discussion in Breiman’s essay, although implicitly. Looking back, the “models vs. algorithms” framing of Breiman’s essay obfuscates the prediction versus causal prediction distinction, because both models and algorithms can be used to estimate causal quantities. (Although at the time it was written causal inference was almost exclusively model-based, this was mainly just because the algorithmic approach was so new.)

Data acquisition

Another major difference between cultures is how their data are obtained. For example, econometricians have historically had very little use for design-of-experiments because their data are usually passively observed (e.g. economic indicators or outcomes of non-randomized social programs). Conversely, biostatisticians’ focus on estimating average treatment effects from randomized controlled trials explains their interest in power calculations rather than methods for dealing with omitted confounder bias. Missing or censored data is a further example of a complication arising from the data acquisition process that completely dominates methodological discussion in some fields (e.g. polling and survival analysis) but are effectively absent in others (quality control).

Modeling

Different cultures also vary substantially in terms of what sorts of modeling assumptions are standard versus those which are taboo. Some of these differences are intrinsic to the use-case. For example, astronomical data from optical telescopes or measurements from physical chemistry might naturally be analyzed using a theoretically-motivated deterministic physical model along with a measurement error component for which independent, homoskedastic Gaussian errors are well-justified. By contrast, a recommender system for suggesting products to online consumers doesn’t admit any plausible theoretical model and the patterns of variability are substantially more complex. At the far other end of the spectrum agent-based models and differential equations based models make extremely strong assumptions about the inter-relationship between observations over time.

To complicate the matter even more, an assumption does not need to be correct to be useful (or at least, not actively harmful). George Box’s oft-quoted slogan “all models are wrong, some are useful” conveys this essential fact, but its pithiness obscures an important nuance: the same model can be useful for one task and useless for another! For example, linear regression is a suitable method even if the true conditional expectation isn’t linear, provided that the estimand of interest is the average partial effect of a particular variable (and assuming other conditions are satisfied). This is more-or-less the vantage point championed by Joshua Angrist in his popular econometrics textbook Mastering 'Metrics. On the other hand, if the goal is out-of-sample prediction, then a linear model would be inferior to a more flexible one if the conditional expectation is, in fact, nonlinear. Thus, modeling assumptions within a culture are driven by matters of use-specific adequacy as well as matters of subject-specific fact.

Uncertainty quantification (inference)

Cultures also differ in whether, how, and how much they account for estimation uncertainty. As with our other criteria, these differences are driven by the subject-area. For instance, the stringency of statistical standards varies dramatically from field to field. While a 5% level has long been the conventional standard in the medical literature, particle physicists demand a much higher level of evidence, requiring orders of magnitude smaller p-values before rejecting a null hypothesis. Largely, this disparity results from whether or not a statistical analysis is intended to be the final word on a scientific or policy question. In medicine, perhaps, and certainly in the social sciences, a “statistically significant” designation primarily flags a finding as worthy of follow-up study. In particle physics, it means something closer to “we can finally stop running all of these resource intensive experiments and put our discovery into the textbooks”. To their credit, psychometrics studiously distinguishes between “exploratory” and “confirmatory” analysis, with the understanding that neither alone is sufficient to do good science. The replicability crisis in the social sciences can, in some sense, be put down to the fact that “statistically significant” was never supposed to mean “known for sure”.

In a somewhat different vein, meteorological and epidemiological models based on differential equations are deterministic, but for unknown inputs. Such models are “calibrated” rather than “estimated”, and the approach to uncertainty quantification is much closer to a sensitivity analysis than traditional inference. Here, the impact of model misspecification is likely far greater than any classical sampling uncertainty, but debates about the model can be investigated, to some extent anyway, by other forms of scientific inquiry based on first principles. Moreover, it is rarely clear that the parameter values are uniquely identified by the available data, even if sample sizes were arbitrarily large. But, if the goal is short-term weather forecasts, such concerns are largely academic and some form of back-testing provides satisfactory data for estimating forecast accuracy. Model invalidity and non-identification are substantially more worrisome if one wants to make climate policy recommendations on the basis of such models.

Similarly, algorithmic prediction models can now be furnished with rigorous prediction intervals via conformal inference methods, but parameter estimation in such models is challenged by lack of identification. Likewise, uncertainty estimates arising from, say, a statistical learning analysis (VC dimension) typically provide impractically large intervals (unless the hypothesis class is severely restricted). Bayesian uncertainty represents yet another approach to uncertainty quantification that is increasingly popular but still, many decades on, not yet mainstream. Here is not the place to dive into that long-standing debate, but it is interesting to note that Bayesian methods have mainly been adopted for problems which the traditional paradigm simply wasn’t able to provide reasonable answers in a practicable way. For example, Alan Gelfand -- popularizer of the Gibbs sampler -- told me once that his work on applied spatial statistics led him to a Bayesian approach, because no other methods could give answer to the questions of interest.

False Virtues

Focusing on use-case highlights that there is no one “right way” to do things. Of course, such “methodological ecumenicalism” does not mean that just anything goes. Rather, it suggests that poor methodology arises mainly when one loses sight of the intended application. Sometimes this happens for mundane reasons, such as habit. For example, medical researchers often apply linear logistic regression reflexively and uncritically, to estimate propensity scores. With flexible regression models widely available, banking one’s results on the validity of a linear model is as unnecessary as it is commonplace.

However, sometimes particular methods are favored on the basis of false virtues, meaning that the properties endorsing the method are unactualized in practice. Examining such situations can be instructive, as it fosters an understanding of why methods are sometimes applied inappropriately, rather than laying blame on a misguided “culture”. Below we consider a (partial) list of common false virtues and give examples of popular methods that boast them.

Ineffectual theory

By “ineffectual”, I mean theoretical results whose conclusions are practically irrelevant, which can happen for a variety of reasons.

Unrealistic assumptions about the data. To take a high-profile example of unrealistic assumptions, consider the LASSO method for performing variable selection in linear regression models. The LASSO is one of the most highly cited applied mathematics papers of all time. Tracing its intellectual history provides some illuminating context for evaluating the LASSO’s merit. The LASSO has its roots in an earlier method called compressed sensing, which was effectively a result in the design of experiments. It explained that to perfectly reconstruct an image, one needed only a logarithmic number of observations, provided that the original signal was sparse in a certain sense and if the observations were (essentially) taken at random. In this application, the first condition was an explicit, but plausible assumption, and the second condition was achievable by the researcher!

When this mathematical result is ported over to the applied statistics context, and applied to observational data, the first assumption is more dubious and the second condition is flat out preposterous! As such, with inapplicable theory behind it, the LASSO isn’t any better than older heuristic approaches such as forward selection, which are more straightforward to motivate and teach and which exhibit comparable empirical performance.

Unreliable asymptotics. "Causal random forests” is a prominent recent example of asymptotic theory that can’t be cashed out in practice. This hugely influential paper made a splash for being the first to furnish confidence intervals (i.e. “inference”) for treatment effect estimates arising from non-linear machine learning models (specifically, a version of random forests). However, simulations with straight-forward data generating processes reveals that the finite-sample performance of the method was not even remotely close to the nominal asymptotic coverage (sometimes as low as 50% coverage on a nominal 95% interval). To be sure, the theory surrounding causal random forests is interesting in its own right, but it should not be taken in any way as an endorsement to use those intervals in practice. (Incidentally, a famous LASSO-related result also falls into this inapplicable asymptotics category: Poetscher and Loeb wrote a number of articles pointing out that the so-called “oracle property” of the adaptive LASSO was non-uniform, meaning that the necessary sample sizes for the result to hold could be arbitrarily large.)

Unrealistic assumptions about the method. An especially common form of a methodological false virtue is when theoretical development persists on simplified versions of a method simply because the simpler thing admits to theoretic study. This mis-match can cut both ways. On the one hand, it means that a theoretical result may not be about a method that is actually being used. For example, a result might be about the case where the variance is assumed known, when in practice it never is. Or it may be assumed that the model is linear because “nonlinear functions are just linear models with the appropriate interaction terms included”. Or maybe the theoretical result applies for a fixed regularization parameter, whereas in practice the regularization parameter is determined empirically via a resampling or data splitting method such as cross-validation.

On the other hand -- and probably worse -- you may have empirical researchers avoiding more sophisticated methods because they are compelled professionally to stick to methods for which there is theory that can be cited. Until quite recently (a decade or less, I would say), some subfields of applied economics could not get away with analyzing their data with anything but linear models for lack of theory. My esteemed colleague Rob McCulloch once quipped that restricting the models you use to the ones you have formal theory for is like riding a bicycle instead of a car on the grounds that you don’t know how to fix the car yourself. But note the irony: the inapplicable theory from the Causal Random Forests paper had the net positive that it gives empirical researchers a bogus license to use a method that was subject to an equally bogus prohibition. You might say that an insistence on formal theory incentivizes counterfeits.

Non-constructive theorems. Yet another way that a theoretical result can be impracticable is if it is fundamentally non-constructive. For example, a fascinating result by Wenxin Jiang shows that boosting algorithms are capable of overfitting (contrary to some folklore), but goes on to show that somewhere along the path towards overfitting, the optimal prediction is achieved. I personally find this result fascinating and reading the paper was edifying, but it’s a good example of a result that is fundamentally non-actionable.

Unrepresentative empirical evaluations

The simulation studies of the causal forest method above were sufficient to show the inadequacy of the theory because the coverage rates should not depend on the underlying DGP. But conversely, a method that performs exceptionally well on a single simulation study does not earn much bragging rights. Virtually all methodology papers these days contain a simulation study and it should surprise no one that the new method always seems to shine! A simulation study is only as valid as its data is truly representative of practice.

Similarly, the field of machine learning has embraced a relatively small number of easily accessible and, hence, horribly overused benchmark data sets. To take a famous example, the MNIST data set of handwritten digits was introduced by Yann LeCunn in 1994 and has been used thousands and thousands of times to evaluate this or that modification of popular supervised learning methods. The current state-of-the-art has an astounding accuracy that is nearly flawless (99.8% accuracy) and substantially above human-level performance (96%). As far as I am concerned, these demonstrations mainly prove to me that a given method is really good at the MNIST data, and perhaps that the MNIST data ought to be retired at this point.

Novelty for its own sake

Methodologists are compelled to create new methods, both out of sincere intellectual curiosity as well as for more mundane reasons like fulfilling career ambitions. The result, sometimes anyway, is the creation of methods in need of an application. Methods researchers, especially young researchers, are often more interested in seeing if something can be done than considering whether doing so would be useful to anyone (other than their resumes). It is much easier to tweak an existing method -- to minimal practical effect -- than it is to tackle the harder problems, the fruit hanging halfway up the tree, as it were. These incremental advances are not infrequently accompanied by overreaching claims about superiority. Whereas some economists were admonished above for clinging to the tradition of linear models long past when promising alternatives like random forests were available, some data scientists often reach immediately for the deep neural network when a logistic regression would probably do. Riding a bike instead of a car is stubborn in one way, using a Lamborghini as a family sedan is another kind of misjudgement.

Conclusion

It may seem to a reader that this essay started out with a call for empathy and understanding and midway through had a change of heart and reverted to finger pointing. But although the tone change was real, from sympathetic to scolding, I believe the logic to be consistent. The point I am trying to make, in both sections, is that methods should be judged by their usefulness for analyzing data. Methodologists who do not believe their work should be judged by its usefulness to data analysis should be forced to announce it in their abstracts and social media posts. In the first part of this essay, I pointed out that it was unfair to criticize a method for not being good at a task for which it wasn’t designed or intended. The second part of the paper took that same claim and turned it around: a method should not be embraced and widely-deployed on the basis of solving problems other than the ones its ostensible users actually face.

I once spoke with a highly-regarded time series researcher who has done consulting work for the Federal Reserve. He nonchalantly explained to me that the methods he used in his consulting work and the methods he published comprise mutually exclusive sets. He was not lamenting this -- the claim was not that the Fed was missing out on the good stuff. Rather, the published stuff simply wasn’t appropriate for applied work and even novel methods that were invented for practical applications weren’t suitable for publication as a matter of academic fashion. As a young, optimistic new methods researcher, I found this heartbreaking. Was the whole field merely lip-service? Are our theorems just mathematical rosary beads? Is developing a popular method more important than developing a principled one? I don’t claim to have answers to these questions, but the more methodologists who ponder them the better for the integrity of the field.

Page updated

Report abuse