Recursive partitioning for epidemiological research

Relevance: Epidemiological Research

In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. However, identifying relevant subgroups can be challenging with standard statistical methods.

I performed a review of the literature on decision trees, a family of techniques for partitioning the population, on the basis of covariates, into distinct subgroups who share similar values of an outcome variable. In this review, I compared two decision tree methods, the popular Classification and Regression tree (CART) technique and the newer Conditional Inference tree (CTree) technique, assessing their performance in a simulation study and using data from the Box Lunch Study, a randomized controlled trial of a portion size intervention.

Summary: Both decision trees were found to identify homogeneous population subgroups and offer improved prediction accuracy relative to regression-based approaches when subgroups are truly present in the data. However, an important distinction between CART and CTree is that the latter uses a formal statistical hypothesis testing framework in building decision trees, which simplifies the process of identifying and interpreting the final tree model.

Relevant Papers:

visTree: Visualisation of subgroups determined by a decision tree

Summary: The visualisation tool, visTree, is an exploratory tool developed to characterise the subgroups determined by a decision tree structure.

Motivation: The standard display of a decision tree structure does not necessarily allow researchers to characterise the identified subgroups. The problem is exacerbated by predictor variables that may not have an interpretable scale built on established norms.

The R package, visTree addresses this limitation and provides a novel visualisation to characterize subgroups generated by a decision tree. Each individual terminal node identified by a decision tree corresponds to a subplot in the visualization. For a complete description of the functionality, a vignette (with examples) is available at cran.r-project.org/web/packages/visTree/vignettes/visTree.html

This work was developed along with Julian Wolfson, University of Minnesota, Twin Cities.