Statshow is an interactive checklist and documentation tool for statistical analyses. The idea is that, several months after you did any analysis (or, worse, a whole bunch of analyses), it’s often very difficult to reconstruct exactly what you did. This tool will create a file that collects all the essential details and inputs together into one place and in a standard format. It would also be able to automatically create (HTML and RTF) files that document the analysis for use in reports or perhaps as appendices to reports. The software might be useful for anyone who has to deal with compliance to those ISO9000 record keeping rules or to the brand new “peer review standards” guidance just issued by the OMB under the John Graham.
Maybe this has already been done somewhere else. Have you seen something like it? I wasn’t aware of any such tools, but, the more I think about it, the more I think we need to have such a program. Here’s a description of what it will do. (It’s currently just a mockup and doesn’t save any input, or have the planned data files that implement the cross referencing described below.)
There are several screen captures of the mockup of the software at the end of this document. On those screens, the leftmost column documents the method and the software you used to conduct the analysis. You can pick a standard statistical procedure from a structured list, and the program automatically knows references for it from Sokal and Rohlf, Zar, etc. (We’ll just survey several statistics textbooks and accumulate page references for the tests.) Likewise, you can pick the software package you used to do the calculations and the program will automatically know the formal reference for the software too. Of course you can also just type in the name of a method or software package not on the list and enter the references by hand. There’s also space for you to enter the run stream, parameters, and procedures you used for the analysis. This would be where you’d transcribe any notes you might have made during the analysis that should enable you to recreate the analysis again.
The middle column documents the assumptions you made in the analysis. When you pick a standard method from the structured list, the program will automatically rack up the assumptions that go with that method. If you enter a method it doesn’t know, you’d have to enter the assumptions yourself. Double clicking on one of the assumptions invokes a little dialog for that assumption where you can see the definition of the assumption and a space for the user to write a justification for making the assumption on the data at hand. (In future, we might consider letting the program automatically test the assumption with the actual data whenever this is possible to do. Resit predicts that people will hate this program if we give it that feature. Depending on the magnitude of the hate it invokes, it might constitute a truly valuable contribution to science.)
The justification may involve one or more other statistical analyses. When this is the case, the user can link to another Statshow file or files, so these files can be cross-linked or nested in a natural way.
When a user picks one of the standard assumptions, the full definition of that assumption comes from a data file. The definition is not editable by the user, and is only provided as a helpful prompt for filling in the justification field. However, if the assumption is newly specified by the user, then the user should enter both the definition and whatever justification it requires.
I’ve also provided space in this column for the user to say how non-detects, outliers and multiple tests were handled. In some cases, there is space for a linked Statshow file. Do you think I need such a space for the censoring treatment too? Maybe for Helsel’s more complicated calculations? I’m not really sure how “control variables” should be documented. The fields about them are just a place holder for now. Any ideas?
A user can add any additional comments at the bottom of the column in a free-form RichTextFormat field. As for other fields in Statshow, this field can be arbitrarily long so it can accommodate whatever detail the user thinks is necessary or appropriate.
In the rightmost column, you specify the data that was used in the analysis. You can specify the number of dimensions and/or groups. Some tests/methods imply certain constraints on the number of dimensions or groups. For each dimension/group you can give the names of the variable and the group, the statistical ensemble (population) it represents, along with the nature of the data, the units they are given in, and any transformations that you applied to the values.
You can paste the actual data into the values field (it can be arbitrary long...it’ll scroll), or specify the source of the data as a uniform reference identifier (including both the locator and the name), or you can both paste the actual data used and say where they came from. If you specify the data values, you can ask the program to make a statistical synopsis of and graphical thumbnail depicting the values you pasted in. It shows enough detail that you should be able to be sure you have all the data (and only the data) you wanted. This is a specification of the data by value, but you can also specify the data by reference by using the source field instead of the values field. If you do that, you might specify who gave you the data, the search key you used on the data base, or any other details essential to recapture the same values in the future. For the sake of comprehensiveness, one would naturally prefer to use the values field rather than relying on some merely referential characterization of the source. But people are strange; many have told me they’d prefer to only give the source. Are they trying to hide their tracks?
The capture button is used to invoke your favorite browser, spreadsheet or data manager. Within the invoked program you can navigate to the data set on your disk, network or, for that matter, anywhere on the internet. On returning, the data themselves would be copied back into the values field, or the location of the data would be copied into the source field, or both the values and the source fields would be filled. The data might not be copied if doing so would violate copyright law or proprietary restrictions. In principle, the capture button could also read and interpret the information in the source field to go collect the actual data and reproduce it in the values field. This would be very useful in updating an analysis if the data are subject to change whether by editing or updating. User confirmation would be requested if doing this would require any data to be overwritten.
When possible, the dimension, number of groups and respective sample sizes will be interpreted from the captured data set. This feature will clearly require artificially intelligent programming, but some conventions and published file structures for input files for popular statistical programs (even perhaps including Excel) can be exploited to gather the needed details. In any case, if the actual data are present in the values field, the spinedit control for the sample size cannot be edited by the user. If the samples sizes vary from group to group, the check box for paired/associated data is unchecked. This could precipitate further if it an unbalanced design invalidates the named statistical test. If only source information is provided, the sample sizes are editable if the source specifies the data only with a starting point.
You can also give details about the team that gathered and manages the data. Pressing the little button opens a little dialog. Entering names on the fields of this dialog, or picking names from a list, and pressing the OK button makes an entry for the team field. Of course the lists of people and companies can be customized, and the lists can be different for each field. We can also make default answers for some or all of the fields.
Finally, there is a space for ancillary notes about the data.
A user can request a report to be written via the Files menu option. Output as either HTML and RTF (Rich Text Format) is supported. The number and formats of these reports can be configured by the user. By default, the program will provide three output modes: terse, verbose and comprehensive.
The terse output includes the statistical method and software (but not their references), the run stream, the list of assumptions (but not their definitions or justifications), single-line characterizations of the treatments for nondetects, outliers, and multiple tests (or no line if the treatment was inapplicable for the data set), comments, dimensions and numbers of groups, a summary of the data in each dimension/group, including the variable and group name, the nature and units of measurement, the transformation used if any, a statistical synopsis of the data (sample size, minimum, maximum, arithmetic mean, standard deviation), and the first and last dozen values of the data set (or all the data if there are fewer than 40 values), any source information, and any notes that were made.
The verbose output will include all the information of the terse output, plus the assumption justifications, at least one line for each special treatment, all the data if they can be displayed on three or fewer pages (or, if they cannot, as much as can be displayed on two pages), the ensemble, all the synoptic statistics (i.e., sample size, minimum, maximum, arithmetic mean, standard deviation, variance, population variance, skewness, kurtosis, and sum), the team information,
The comprehensive output will be contain all information held by Statshow in the current file (including each and every datum and both the definition and justification of each assumption), and terse summaries of any nested or linked Statshow files referenced in the current file.
Users will also be able to fashion customized reports by editing a tokenized report format. Different formats can be designed by the user and given appropriate names. Any format can be selected on the fly at print time. We have already implemented the requisite report generation features in other programs developed at Applied Biomathematics.
The data set issues that have not yet been addressed in our thinking or the program mockup include the treatment of missing values, and provisions for specifying controls of various kinds. Any thoughts on either of these issues would be heartily appreciated.
If this development effort is undertaken, the program will probably be restructured to allow some modularity in the interface and displays. Displays should telescope in a natural way so that the initial display is not overly complex and scary or overwhelming to beginning or marginally technical users. We are undecided about which design elements will be used to implement this in the user interface. We might use paging, tabbing, a menu structure, dimming, or other conventions. Naturalness, ease of accessibility, and flatness of structure will be the goals of the interface options we employ.
We will certainly incorporate time stamps and record file dates and checksums into the data structures we use so that Statshow can alert you about inconsistencies, such as when the data files have changed since you referenced them.
Of course, we understand that this whole thing is supremely goofy, because it pretends you can regularize something as complex and nuanced as statistical analyses. But there is something to be said for structuring reportage. God knows that it’d be smart to keep track of stuff in a way that makes the information retrievable. If we can make the interface rich enough to be reasonably comprehensive for a healthy majority of the routine analyses that are done on a big project, and yet simple enough not to scare the pants off the people who’ll be saddled with the QA and documentation tasks, the result might actually be something that would be, in the end, a helpful convention. It’s hard for me to believe that it could be much worse than the utter lack of regularity that we now suffer. What do you think?
We welcome all criticisms and suggestions. Send emails to Scott Ferson (ferson@liverpool.ac.uk).