Post date: Mar 18, 2021 12:58:48 AM
Copied from https://sites.google.com/site/diseaseclusterdetectionp/say/robo-statistician#
In the cluster detection projects, we have developed a variety of algorithms for computing statistical power which are essential for determining what analytical approaches are appropriate and optimal for various medical situations. We had routines to compute about two dozen test statistics, routines to compute their means, variances, and (exact) p-values associated with the (exact) tests, and many routines based on permutation to automate power calculations for each statistic for various sizes, shapes and designs of data sets, and for a wide array of clustering models that we could devise in our biologists' heads.
After we set all this up on the computer, we were not overwhelmed by the power profiles. The tests just weren't all that powerful. It finally occurred to me that we are limited primarily by the statistics we have defined. And this is a limitation of our imaginations, as I think Neal Oden's on-the-fly creation of an entropy statistic for the exemplar problem kind of demonstrates.
At first I thought maybe we should develop a high-level language for statisticians to define statistics on the fly. Then we could easily apply our power calculation apparatus to see if the statistic had any merit. So what we thought about then was how we might design a language to express the statistics that we have already. You wouldn't need much. Sums, maxes, second largest value, arithmetic operations, maybe a few other things. I thought that if we decoded all the operations one needs to specify the two dozen statistics we already have, we'd probably already be most of the way toward a reasonably complete language for such statistics.
We didn't even get through the day with this idea before something even more interesting popped up. Why don't we let the computer invent the statistics? If you've got a language to define statistics, it's a simple matter to randomly construct expressions in that language, and possibly apply a genetic algorithms approach to designing good statistics. If we include the ones we've already defined plus all the ones the computer creates and rigorously study them in simulations to find the most powerful set for a specified topology and cluster mechanism, we cannot do any worse, right? If we use permutation, we don't need to use the exact formulas. Permutation is so fast on the most recent computers, there wasn't much advantage to the exact formulas anyway.
It seems to me that all we need to let a computer invent statistics is to apply our battery of power calculation routines against random creations from a language that creates putative statistics, and test them against a variety of configuration topologies and cluster mechanisms. I got really excited about this idea of making a computer-statistician (think RoboCop), but that's about when our funding ran out. But, if you think this is an interesting idea, perhaps we could chat about it sometime. Obviously, the general idea is, well, pretty general and could be applied to a host of other basic statistical problems.
Once you have a way, a language, in which statistics can be expressed, and a way to check whether an expression in that language is a statistic that's good for anything, i.e., has good power in some configuration, then you could let the computer crunch and design the best tests to use in that configuration. The more I think about it, the less I understand why people aren't doing this already. What do you think?
I'm excited about robo-statistician. What could go wrong?