While the world awaits a cure for breast cancer, experts continue to develop tools and practices to best prevent, diagnose, and treat breast cancer. Currently, one of the most important things one can do to prevent breast cancer is regular and early screening, the most effective screening process is a mammogram. Despite mammogram's being key for early detection, not every who should get one is able to. This begets a key question: which factors influence a patient's likelihood of receiving a screening?
We approach this question through the following steps:
A multivariable regression model, generated with
forward selection
A decision tree and a random forest, both utilizing the covariates from the forward selected model
A bag decision tree and a bag random forest
From the years 1989 - 2017, breast cancer motality has decreased by an astounding 40%! This decrease in mortality is largely attributed to early diagnostic measures, such as breast cancer screenings.
https://www.cancer.org/latest-news/facts-and-figures-2020.html
The multivariable model has been generated from a forward selection process. The covariates have each been tested to fit the following assumptions: linearity, independent, normality, and homoscedacity.
Residual standard error: 0.96 on 2079 degrees of freedom
Multiple R-Squared: 0.8872; Adjusted R-Squared: 0.886
P-Value < 2.2 e-16
Above is the decision tree generated by both the covariates selected from the forward selection and the bag model. The most common characteristic in this model is pap test, which is typically used to screen for cervical cancer. Intuitively, this makes sense, as the ability and desire to receive a pap-test is similarly reflected when receiving a mammogram.
Some other significant variables are: diabetes prevalence, obesity prevalence, annual unemployment rate, and high school graduation.
To assess which model is the most predictive, we checked the mean standard error (MSE) between the predicted values for the test set, and the actual values.
Linear Model MSE: 1.915
Decision Tree MSE: 2.365
Pruned Regression Tree (not pictured) MSE: 2.417
Random Forest MSE (not pictured): 0.965
Bag Random Forest MSE (not pictured):
The random forest has the smallest MSE, but is also the least interpretable. We can, however, use the random forest in parameter assessment.
These are the most 'important' variables (based on predictive power) according to the random forest model. The pap test and obesity metric have the largest values, so we will visually assess their relationship to mammography use. However, our U.S. analysis demonstrated the importance of diversity in understanding breast cancer mortality. So we will also include diversity in our visual assessment below, where we have categorized mammography use as high (above the NYC median) or low (below the NYC median).
This graph depics the general prevalence values that pap tests and diversity fall under, and which mammography category they correspond to. High mammography rates appear to strongly correlate to high pap test prevalence. However, mammography use appears to be scattered relative to diversity.
This graph depicts the general prevalence values of obesity rate and diversity. Low mammography rates appear to be associated with lower obesity rates and high diversity rates. This graph also has a significant overlap between obesity and diversity.
This graph depicts the general prevalence values of pap tests and obesity rate, with the corresponding rate of mammography use. Again, high prevalence of pap tests are associated with high mammography rates, and low obesity rates are associated with low prevalence of mammography rate.
Breast cancer screenings are a crucial step to diagnose and treat the disease. However, certain demographics are far more likely to receive these screenings. While numerous listed variables show some association with receiving breast cancer screenings (and undoubtedly, many unlisted variables), some variables have much higher significance.
For example, receiving pap smears is highly correlated with receiving mammograms. This is somewhat intuitive, as this metric reflects both the ability and desire to receive healthcare. Another important variable is obesity rate. The graph above shows how places with low obesity rates seem to have low mammography use, relative to the distribution of mammography use across diversity.
These trends raise an important question - how do we enable people without the resources to receive these screenings to receive them? Not only does mammography use vary within individual subsets, but these subsets are highly correlated with each other, creating numerous subsets of individuals. To continue to increase the rates of mammography use, it is important to identify the subsets of people who need them most. The variables above show predictive power for determining the likelihood of people receiving mammograms and can function as the first stepping stone in determining the groups of people who are least likely to receive mammograms and are thus most likely to have untreated breast cancer.