(This post was uploaded on 8th February 2019. It was revised on 14th February 2019.)
An article by Deborah Mayo in Statistical Science (Mayo 2014) reignited a debate about the validity of Birnbaum's argument for the likelihood principle (Birnbaum 1962), which is often referred to as Birnbaum's theorem. One would have hoped that by now everyone would have realised that the proof of this theorem is horrendously and unsalvageably flawed. However, still people are willing to defend this theorem, e.g. Gandenberger (2015) and Peña and Berger (2017). For this reason, I express in the appendix my own opinion about its logical inadequacies. However, from a sociological perspective, an intriguing question is how such a red herring was allowed to live such a long life.
Therefore, let us go back to 1961. By this point in time, Birnbaum had somehow managed to achieve the status of being a gilt-edged member of the statistical establishment, on the basis of what could be argued to be fairly meagre contributions to the foundations of statistics (although mainly published in top statistical journals). This status meant that it would have been fairly easy for him to get his famous (and very long-winded) 1962 paper to be read before the American Statistician Association (ASA) with George Box in the Chair on 27th December 1961 (surely most attendees would have been better off enjoying the Christmas period with their families).
At this moment, I should point out that I bear no personal dislike of Birnbaum. I am aware that he was considered to be a lovely human being (which is a quality that goes above anything else), and we can all feel a great empathy with regard to the circumstances surrounding his untimely and tragic death.
The result though in his famous paper was convenient for a lot of people, particularly the influential and highly respected Leonard J. Savage, who gave the following majestic introduction to his evaluation of the paper in the meeting in question:
"Without any intent to speak with exaggeration or rhetorically, it seems to me that this is really a historic occasion. This paper is a landmark in statistics ... It would be hard to point to even a handful of comparable events."
It is likely that most people who read the 1962 paper that resulted from this meeting, looked at the premises concerned and the supposed consequences, and then skimmed through the rather short corresponding proofs. Could anyone believe that a person like Allan Birnbaum could be so badly wrong? Apparently not, and this would explain why the early objections to the theorem really only nipped around its ankles (Durbin 1970 and Kalbfleisch 1975).
It was not until 1990 that Joshi dared to go closer to the heart of the matter and by doing so, claimed that the proof of the theorem contained a fallacy (Joshi 1990). Nevertheless, two years later, this did not stop Birnbaum's paper taking pride of place in the prestigious book "Breakthroughs in Statistics" (eds. Kotz and Johnson, 1992, Vol. 1). Obviously Joshi was not a member of the right private club.
Closer to the present day, Mayo began developing in Mayo (2010) an argument against the theorem by expanding on the line of attack used in Joshi (1990), which led to her aforementioned 2014 article, while, during the same period, Evans also presented a notable critique of what it succeeds in actually establishing (Evans 2013).
Of course, the likelihood principle rightly remains an important point of discussion in statistics, but that is not the issue at stake here. The question is whether Birnbaum's theorem adds to its importance in any way, and my answer would be a very clear no.
It is natural to think about whether Birnbaum realised his blunder before his death in 1976. I would view this as being unlikely as he seemed to have been a fairly responsible man, and surely he would have been interested in helping humanity by contacting the ASA to withdraw the paper in question before he very sadly took his own life.
The key concept in Birnbaum's theorem is that of an evidential equivalence (which we will call an EvE) of the type Ev(E,x)=Ev(E*,y). This concept is very delicate, but nevertheless an adequate basis on which to perform argumentation. However, the first difficulty we find in Birnbaum's theorem is the lack of a clear guide about the rules that these equivalences are supposed to follow. Obviously, we would expect these equivalences to always retain a tangible meaning, after all we are doing statistics here (which is about the real world) and not abstract mathematical nonsense(!).
We will classify an EvE as being legal when it has a tangible meaning and illegal when it does not. This immediately gives us the following rule.
Golden rule: if an EvE has the same data x or the same experiment E on both its left and right-hand sides, then the same EvE can not be reapplied with different data y or a different experiment E*, as that would imply that the data or the experiment are two different things at the same time (which is impossible).
Let us now clarify some notation:
SP = Sufficiency principle (as in Birnbaum 1962)
CP = Conditionality principle (as in Birnbaum 1962)
SLP = Strong likelihood principle (Likelihood principle in Birnbaum 1962)
WLP = Weak likelihood principle (Lemma 1 on page 278 of Birnbaum 1962)
It is clear that each of these principles is expressed using a legal EvE, which is what you would expect (since, as mentioned above, everyone checked the principles but not the proofs!)
Here is a list of some of the results that are claimed in Birnbaum (1962):
1) SP implies WLP. Wrong! An illegal EvE is needed in the proof.
2) CP implies SP. Wrong! It is not possible to attack a proof when one is not really offered. However, there has been a general consensus for a long time that Birnbaum's intuition here was wrong.
3) Main result: SP and CP imply SLP. Wrong! Depends on result (1) being true.
4) Adjusted main result: WLP and CP imply SLP. Wrong! To go from equation 5.2 (of Birnbaum 1962) to equation 5.3 requires the use of an illegal EvE.
Here is a pair of rather inconsequential results that can be trivially proved without using an illegal EvE.
5) SLP implies WLP. Right! (Thank goodness)
6) SLP implies SP and CP. Right! (Such a shame that this is only Birnbaum's result backwards)
Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association, 57, 269–306.
Durbin, J. (1970). On Birnbaum’s Theorem on the Relation Between Sufficiency, Conditionality and Likelihood. Journal of the American Statistical Association, 65, 395–398.
Evans, M. (2013). What does the proof of Birnbaum’s theorem prove? Electronic Journal of Statistics, 7, 2645–2655.
Gandenberger, G. (2015). A new proof of the likelihood principle. British Journal for the Philosophy of Science, 66, 475-503.
Joshi, V. M. (1990). Fallacy in the proof of Birnbaum’s Theorem. F27 in discussion forum of Journal of Statistical Planning and Inference, 26, 111–112.
Kalbfleisch, J. D. (1975). Sufficiency and Conditionality. Biometrika, 62, 251–259.
Mayo, D. G. (2010). An error in the argument from conditionality and sufficiency to the likelihood principle. In Error and Inference (eds. D. G. Mayo and A. Spanos), 305–314, Cambridge University Press, Cambridge.
Mayo, D. G. (2014). On the Birnbaum argument for the strong likelihood principle (with discussion). Statistical Science, 29, 227–266.
Peña and Berger (2017). A note on recent criticisms to Birnbaum’s theorem, arXiv.org (Cornell University), Statistics, arXiv:1711.08093.
(This post was uploaded on 20th August 2022.)
In looking for a method of inference or family of methods that would resolve the Foundations of Statistics Crisis we may ask what principles we would expect this method or family to obey? Here I offer my list of 7 such principles.
Principle 1: The use of the method or family of methods must be justified by a sound, defensible and consistent mathematical and philosophical logic.
Principle 2: The method or family must be able to make inferences that are relevant to the problem at hand and the data that have actually been observed, i.e. inferences in some sense need to be conditioned on the data.
Principle 3: The method or family must make efficient use of the information that is contained in the data so that the most precise inferences that are possible can be made.
Principle 4: The method or family must be able to flexibly and adequately take into account all pre-data or contextual information that is relevant to the inferential problem, including in particular the absence of certain elements of this information.
Principle 5: Given that probability is the most universal and convenient measure of uncertainty, the method or family must be capable of placing post-data probabilities on the hypotheses that are relevant to the scientific problem of interest.
Principle 6: The method or family must in some way take into account how likely it would have been to generate the actual observed data under the various models / parameter values of interest as opposed to the likeliness of data similar to or more extreme than this data.
Principle 7: The method or family must, in some way, address the fact that, in practice, model assumptions are almost always wrong.
In my opinion:
Frequentist theory does not consistently obey any of these principles except, perhaps, Principle 3. Objective Bayesianism also breaks most of the principles. Subjective Bayesianism does better, but fails to obey Principles 4, 7 and to some degree Principle 1.
(This post was uploaded on 27th August 2022.)
Designing a hypothesis test under standard theory, i.e. Neyman-Pearson (N-P) theory, requires fixing a significance level which we denote as alpha. This "fixing alpha thing" is a key for opening at least 3 doors which reveal the flaws that underlie this testing theory.
To begin, we note that controlling error rates in decision making is the core idea on which N-P theory is based and in particular two types of error are identified, i.e. the famous Type 1 and Type 2 errors. The error probabilities alpha and beta are defined as follows: alpha = P(Type 1 error | H0) and beta = P(Type 2 error | H1) where H0 and H1 are the null and alternative hypotheses, respectively. In controlling error rates there usually needs to be a trade-off between alpha and beta since decreasing alpha will generally result in beta increasing. The best trade-off is determined through a loss function.
Personally I can not see how such a loss function could be chosen without considering the prior probabilities of H0 and H1 being true since alpha and beta are very difficult to interpret, however let us put that to one side. We will simply assume that a trade-off between alpha and beta needs to be done in some way. This though should immediately set off alarm bells as a large part of N-P theory relies on fixing alpha in isolation. We will come back to this topic.
However for now we will simply open Door 1 to the problem of "that fixing alpha thing" and we find the obvious: if we fix alpha at some level then no matter how small the P value turns out to be it only matters that it is less than alpha. This is what the theory says. Neyman and Pearson may have tried to put a gloss on this but we should only be interested in the actual theory. Doors 2 and 3 will be opened in the following parts of this post and on opening these doors it will be argued that N-P testing theory is not really a theory but just an analysis of Fisher's significance testing theory.
An introduction was given in Part 1 to the issue of fixing the significance level alpha in isolation in standard (Neyman-Pearson) testing theory, i.e. without reference to a trade-off between the two error rates alpha and beta. Let me now give more details.
First question: When are we guilty of fixing alpha in isolation? In the Neyman-Pearson lemma? No, in this lemma we consider two point hypotheses and the lemma identifies the smallest beta that corresponds to any given alpha. This therefore allows an alpha/beta trade-off.
What about the concept of a uniformly most powerful test? Yes, here we have a problem. This is because we first fix alpha and then we consider various different alternative point hypotheses each of which would correspond to a different beta. We clearly cannot trade-off alpha and beta if the relationship between alpha and beta varies depending on what is the exact alternative hypothesis. In fact, this issue generally raises its ugly head whenever the null or alternative hypotheses are composite hypotheses, i.e. they cover a range of values for the parameter of interest.
Therefore if alpha can not be traded off with beta, how are we going to choose alpha? Well back in the 1930's there was this guy called Fisher who had an alternative theory of hypothesis testing (with its own separate philosophical justification) and he often liked setting alpha at fixed values such as 0.05 or 0.01. And so hey, why not ignore the core principle that underlies our "theory" and steal this idea of fixing alpha in isolation from Fisher? And it appears that's exactly what Neyman and Pearson did, meaning that a large part of their "theory" effectively became just an analysis of Fisher's testing theory.
In the next part of this post the final door will be opened to the problem of "that fixing alpha thing" and we will look at what (if anything) can be rescued from Neyman-Pearson testing theory.
Let us continue with the theme of how in the development of standard hypothesis testing theory, its founders Neyman and Pearson strayed away from their core idea of trading off the error rates alpha and beta and instead (without a clear justification) chose to fix alpha (the significance level) in isolation.
In particular, let's open door 3 to the problem of "that fixing alpha thing" by considering the case where we have two point hypotheses and the data is not at all consistent with either hypothesis. Here if we fix alpha then we may reject the first hypothesis and accept the second one or reject the second hypothesis and accept the first one simply depending on which of the two hypotheses is labelled as being the null hypothesis (!)
Also, justifying a certain class of confidence intervals as being "compatibility intervals" by means of the idea of inverting a hypothesis test would clearly be more sensibly done based on Fisher's rather than Neyman-Pearson's type of hypothesis test since fixing alpha in isolation has an adequate meaning when conducting the former type of test.
We can though make a useful connection between Neyman-Pearson testing theory and standard (Neyman) confidence interval theory. In particular, in the case where there are just two point hypotheses of interest, if we always chose to set alpha equal to beta, then the overall error rate would always be equal to alpha (= beta) independent of what prior probabilities are given to the two hypotheses concerned.
This is a similar property to the property that confidence intervals possess of covering the true value of the parameter with a given (pre-data) probability. However, the strategy in question has the same major drawback as confidence interval theory, i.e. post-data uncertainty may feel very different from pre-data probabilities!
In the fourth and final thread of this series some general conclusions about Neyman-Pearson testing theory will be made.
This post has looked at standard (Neyman-Pearson) testing theory from the point of view provided by the requirement to fix the value of alpha (the significance level of the test). Let us now draw some general conclusions about this testing theory based on this analysis.
First, we may argue that Neyman-Pearson (N-P) theory is just an attempt to "make a Bayesian omelette without breaking the Bayesian eggs". In particular, if we were allowed to place prior probabilities on or over the null hypothesis (H0) and alternative hypothesis (H1) then we could satisfactorily resolve the issue of how the error rates alpha and beta are traded off when these two hypotheses are point hypotheses and also we would extend the core N-P principle of trading off alpha and beta to the case where H0 and/or H1 are composite hypotheses.
Without the ability to perform an alpha/beta trade-off in this latter case, N-P theory resorts to setting the value of alpha in isolation which does not have a justification within the logic of this theory, something which is easy to overlook given that this practice does have an adequate meaning in Fisher's testing theory. However if this practice is simply "lifted" from Fisher's theory, then N-P theory effectively just becomes an analysis of Fisher's theory (an analysis that for me adds little to what could be established if we just stayed within the logic of Fisher's ideas).
Finally, can we make N-P testing theory truly independent of the choice of a prior distribution? Yes we can by setting alpha = beta meaning that the overall error rate is equal to alpha (= beta). Doing this brings the logic of N-P testing theory into line with the logic of confidence interval theory, but this idea has not caught on. This is perhaps because it is difficult to make the error rate alpha look as though it is a post-data rather than a pre-data probability!
(This post was uploaded on 3rd September 2022.)
Having studied languages and literature at university, it appears that Edgeworth's knowledge in mathematics was acquired almost entirely from self-study, which is remarkable given the deep mathematical insights into statistical theory he was able to present. In this respect, he expanded on and corrected the work of his predecessors, including Laplace, and most famously developed a way of approximating probability distributions through the use of Edgeworth series.
However, the work of Edgeworth has been overlooked perhaps for 3 reasons:
1) He gave us no new big and complete statistical method;
2) He presented his ideas in ways that were a bit obscure;
3) He was (deliberately!) given little credit by Karl Pearson.
The importance of his work in statistics comes from adding up his many smaller results and the fact he gave us key ideas that later would be more fully developed (or simply presented again!) by Pearson and R. A. Fisher. For example, Edgeworth put forward Pearson's correlation estimate (r) before Pearson did(!), he also invented Pearson's chi-squared test for the case where there are only 2 categories and in 1892, he presented ideas relating to multiple regression and multivariate normal densities five months before Pearson first publicly discussed similar ideas.
Furthermore, he outlined in 1885 how a two-way analysis of variance can be done, although without the F distribution that would be contributed later by Fisher, and in 1908, he gave a proof of the asymptotic efficiency of maximum likelihood estimators which was long before Fisher showed the existence of this property.
Edgeworth's work was frequently praised by Francis Galton, but given that Edgeworth's main interest was in analysing economic rather than biological data, Galton ended up collaborating much more with Pearson. By being applied in the field of economics, Edgeworth's very technical statistical ideas were kind of "in the wrong place at the wrong time".
His outlook to inference is often described as Bayesian, but Edgeworth was pragmatic and showed that he was willing to take a frequentist viewpoint where necessary.
(This post was uploaded on 10th September 2022.)
Debates about foundational issues in statistics are dominated by the conflict between Bayesians and frequentists. Therefore, which group is right? It would seem natural for an outsider to conclude that they both must be wrong. This intuition can in fact be backed up by argumentation. Moreover, it would appear that the debate between the two groups is so well balanced due to them both being wrong to an equal degree.
The Bayesian paradigm is simple and tidy, but it falls down badly when we need to use the prior distribution to express a lack of knowledge about the model parameters, which unfortunately is nearly all the time. The frequentist paradigm is supported by a lot of good intuitive sentiment about how inference should be done, but unfortunately frequentists are easily led astray because they attempt to achieve 'objectivity' by using frequentist probability which is in fact fairly useless for weighing up uncertainty about a fixed but unknown parameter.
Bayesians perhaps would get closer to doing the right thing if they accepted that their paradigm has very definite limitations, while frequentists perhaps could do better if they did not treat long-run frequentist properties as the 'queen bee' that must be protected at all costs. However, trying to make incremental improvements may not be the best idea.
Also, while a reconciliation of frequentism and Bayesianism may keep some supporters of both camps happy, what is good for social harmony may represent very bad science. I feel we (humans) need to think harder so that we can go beyond this dichotomy of statistical ideologies.
(This post was uploaded on 17th September 2022.)
In Laplace's first major paper on statistical inference published in 1774, he derived and applied a more general version of Bayes' theorem than had been earlier proposed by Bayes himself. At the time, Laplace and his peers thought this work was completely rather than only partially original, as they did not become aware of Bayes' work until around 1780.
In a later work, Laplace would go on to present Bayes' theorem in the general form as we recognise it today. Nevertheless, he generally advocated the use of flat/uniform prior densities without any of the (quite legitimate!) philosophical concerns that Bayes had. This emphasis on doing statistics using mathematics without philosophy was typical of Laplace but quite out of keeping with his era.
However, Laplace entered the 'statistics orchard' when there were many juicy mathematical apples to be picked and given his enormous prowess in maths, Laplace took full advantage. (Advice to mathematicians: such apples are not so easy to find these days!) As a result, he gave us the first general central limit theorem, in particular, he showed how the binomial distribution can be generally approximated by a normal distribution, and he invented characteristic functions (a very useful idea!)
He was also the first to estimate a normal regression model using the method of maximum likelihood, although from a Bayesian perspective. Nevertheless, his general ease in switching from a Bayesian to a non-Bayesian viewpoint meant that he influenced the work of future generations of both Bayesians and frequentists. For example, he followed a distinctly frequentist path in successfully justifying that least squares estimators of regression coefficients are, in effect, best linear unbiased estimators.
The relevance of Laplace's statistical ideas was kept on track by his keen application of them to important data sets of his day.
(This post was uploaded on 8th October 2022.)
The Behrens-Fisher problem is the problem of making inferences about the difference between the means of two normal populations based on independent samples from each population. Of course, such a fundamental problem is discussed as a matter of routine in all introductory textbooks on statistics. Well not quite!
In fact, this problem is rarely discussed in such books, except when:
1) The variances of the two populations are known, which is rarely the case in practice;
2) The sample sizes are big enough so that we can use a z test, which effectively means using an approximate solution to the wrong problem;
3) The population variances are assumed to be equal, which implies that we can use a Student t test, but leads to the false impression that this case is important in practice and that interventions generally affect means rather than variances.
The way that the genuine problem is deliberately overlooked gives a prime indication that things are not well and never have been well in the foundations of statistics. In short, this is because there is no solution to the Behrens-Fisher problem using standard frequentist (i.e. Neyman-Pearson) theory if the two population variances are unknown (and are possibly not equal). Approximate confidence interval solutions to the problem have been suggested, e.g. Welch's solution, but are unsatisfactory both for being approximate and for being based on confidence intervals! (and therefore relate to inferences that are not, in general, relevant to the case at hand).
In my opinion, this problem has a simple solution. I am referring here to the solution proposed by Ronald Fisher himself, which uses the Behrens-Fisher distribution. Of course to be able to regard this solution as being acceptable you really need to be able to embrace fiducial inference, which Fisher did and I do. Any other takers?
(This post was uploaded on 15th October 2022.)
The motivation for much of Galton's work in statistical theory was the theory of natural selection that had been developed and published in 1859 in the famous book "Origin of Species" by his cousin Charles Darwin. In particular, Darwin's theory relied on fresh variation being introduced into living populations with each new generation that is born, but how does the amount of variation not explode out of control as we move through generations?
Galton's first major attempt at answering this question in 1877 was a failure, but by using his famous bean machine (a device that physically demonstrates the importance of a normal distribution) he did show (by analogy) for the first time (among other things) that a normal mixture of normal densities is itself a normal density.
When Galton returned in 1885 to try to fill the same hole in Darwin's theory, many new statistical ideas were brought out in the open. This time his solution was that an explosion of variability in the population was avoided by "regression to the mean". The interplay between normal linear regression models, the bivariate normal distribution and the correlation between variables that is commonly taught to undergraduate students was essentially all discovered by Galton.
By being the first to study in depth the properties of a bivariate normal distribution, he essentially founded multivariate statistical analysis. The ideas of marginal and conditional distributions had not been really discussed before Galton developed these ideas and the concept of correlation had not been clarified.
Like any good applied statistician, he had a hunger for data and in this respect he pioneered the use of questionnaires in genetic and psychological research. He was highly influential on later statisticians and was greatly respected and praised by many of them, e.g. Karl Pearson and Ronald Fisher.
Postscript
Some sage advice to Francis Edgeworth from the older Francis Galton in a personal letter dated 28th October 1881: "It is always the case with the best work, that it is misrepresented, and disparaged at first, for it takes a curiously long time for new ideas to become current, and the older men who ought to be capable of taking them in freely, will not do so through prejudice."