Theme and goals

Advances in storage and networking, and scale up in deployment of intelligent tutors, have led to an explosion of data available for analysis.  This expansion of raw material for analyses has taken place both in terms of columns and rows of data.  Columns of data refer to the number of variables available, and has expanded as more data are logged by tutors, and as more external sources of data are applied.  For example, if a researcher builds a series of student models for various cognitive (e.g. knowledge tracing) or behavioral (e.g. off-task detector) properties, that expands the number of columns available for analyses.  Another common method for greatly expanding the number of columns is to join two databases together; for example, one could relate to semantic properties words the student used in response to a question.  Increases in the number of rows are a result of recording datasets involving more students, or recording data at a finer-grain size.  For example, recording every student response rather than just summary data for the student.

In general, both of these of trends are useful, and have greatly extended the scope and quality of analyses performed from data collected by ITS.  An increase in the number of columns means that researchers are now capable of testing many more hypotheses than they were capable of previously.  In addition, an increase in the number of rows results in greater statistical power, enabling greater sensitivity for detect small effects that may exist.  Although these advances have brought great benefits, there is also a definite cost in terms of an increase in the number of analyses that are reportable, but are of marginal utility and may even be false.

The reason for concern is due to arithmetic.  First, as the number of columns grows, the number of testable relationships increases as columns2, since each new variable can be tested against all of the existing variables in the database.  Second, the ability to detect statistically “significant” effects increases with the number of rows, and increases according to sqrt(rows).  This two effects are additive, and result in a vast increase in the number of significant relationships one can discover from the collected data.  The problem arises when one considers the number of useful relationships in the data.  Many, many variables will correlate with each other just due to random chance, or due to being associated merely by sharing a common cause.  Discovering all of these chance associations is not exciting from a research standpoint, but by community standards, such results would be publishable, and it is not always immediately obvious from statistical hypothesis testing which results are of interest and which are not.  Simply put, we do not want to be in a community where researchers are reporting every effect they discover that has a small p-value.