Data Critique
Our data breaks down cancer mortality across the U.S. population by state, age, race, sex, and cancer type, with each one becoming a variable. Our data can illuminate trends among these variables to identify whether each variable is associated with cancer mortality in some way. Additionally, the variables can be analyzed in connection with one another to assess whether these variables can produce a compound effect in regards to cancer mortality. Our data does not serve as a primary tool to understand cancer mortality trends because it is a compilation of information collected over a span of years, but without the year as a variable. As a result, it paints an overall picture of associations, but will not show dips and spikes which can indicate important trends. Therefore, this data can be used in collaboration with trend-related data in order to make assessments about overall context behind the instances of cancer mortality in the U.S. population from 2007-2013.
In regards to issues with the data, we noticed that there are some zero entries in the data. Some entries can be obviously seen as missing data. For example, Idaho’s data suggests that there are no cancer deaths from Black females (Hispanic and Non-Hispanic) in Idaho, but 145.3 deaths per 100000 population for Non-Hispanic Black males and 165.3 deaths per 100000 population for Hispanic Black males, which suggests that there is missing data for the Black female columns as this shows that there is a Black population where data is collected from in Idaho but the data did not seem to record Black females. In other cases, this is less obvious due to zeroes across or only in certain columns representing an identity having a zero. We cannot make such assumptions due to the fact that there could be a smaller proportion of a certain demographic in a given state which led to smaller tallies or the fact that due to a certain demographic being a smaller proportion of the total population where it’s harder to collect data for these demographics. However, this does lead to questions of systematic issues in data collection to be further investigated.
We decided not to make assumptions and fill in these zero values with any averages (mean, median) due to the potential of data skewing especially due to the fact that we do not have comparable populations to fill in missing data in our case as we are looking at trends across different states and such patterns can be lost if we replace a state with missing data with data from other states. Likewise, as we attempt to apply intersectionality with our data, we treat combinations of identities as separate categorical variables into consideration of disparities between these combination groups and cannot assign values for zeroes into missing data for single identity categories (for the example of Idaho before, we cannot assign a missing value for Black females based on values for Black males), and we decided to not take out any columns as different columns account for different combination groups and are essential for intersectional analysis. This possibly accounts for precision errors within our statistical findings and visualizations in regards to real life true values, but our findings, especially trends and disparities, were ultimately consistent with findings found by research articles cited.
Into how the data is organized, we realized a potential issue in the fact that they count Hispanic as a race which would imply a potential for double counting of individuals in the data as in data collection, Hispanic is not typically defined as a race but rather a question that’s answered in a yes or no format in regards of whether an individual identifies as being Hispanic/Latino. The data also acknowledges this in the way that they separate the group White into Hispanic and Non-Hispanic subgroups. The data does the same separation for Black as well except for the total, which could mean that it’s harder for us to analyze disparities between Black people of Hispanic and Non-Hispanic heritage. Nevertheless, for data analysis purposes, it’s important for us to consider Hispanic as a demographic in its own right rather than following levels of identifications such as race and ethnicity. This way, we see disparities between Hispanic death rates and other demographics that Hispanics are a part of such as White and Black and their separations into Hispanic and Non-Hispanic subgroups could reveal the role multiple identities play in health. We also note that the data could be made more precise by being separated into various ethnicities of a certain group rather than race as although we note that there are disparities between different races, there are disparities that we cannot see in the data between various ethnicities within each race as well.
Additionally, the data does not provide a lot of variations in types of cancer; only three types mentioned. It could be interesting to know why these types were focused on or why others weren’t included. We do acknowledge that this narrows our scope of analysis for whether there are disparities between different groups for each cancer as there are only three examples to look at. We also noticed that they did not include data for analysis into disparities for sexes for breast cancer, which could be due to the fact that females have significantly higher rates of breast cancer than males. This still would have been useful to visualize and apply ideas of intersectionality into as we do have the data for race for breast cancer. Finally and probably the most important limitation to be noted is that although we try to analyze the relationship between socioeconomic status with identities and cancer death rates, the data contains no data regarding income level which could have been a critical measure for socioeconomic status. We instead relied on outside research to fill in this gap.
Ideological Effects of Data Organization
In regards to the ideological effects of the way in which sources have been divided into data, the accumulation of data for this cancer mortality database involves the choice of which sources or database to utilize, a decision that may introduce bias. For example, multiple organizations produce cancer mortality statistics and data, and a choice has to be made in regards to which data or reporting institution to use for the dataset, and that decision can potentially introduce bias. The division of sources and variables may also introduce a set of confirmation bias as well. As mentioned before, breast cancer is a type of cancer that overwhelmingly affects women, but can affect men in rare cases. It is a deliberate decision, or an error in data collection, to completely exclude male instances of breast cancer in the dataset. Lastly, the data is limited to the information reported by the agencies involved. Because of the wide range of funding allocated to cancer mortality research across the country, there exists a disparity in the race, age, sex, and cancer type that is reported, which can affect the findings of this dataset. By over, or underrepresenting certain populations or variables, the data may potentially inadequately represent the diverse nature of cancer across the country.