Introduction

Column

Background

Title: Socioeconomic and Demographic Impacts on Average Cancer Diagnosis and Deaths per Year in the US

Author: Alexa Neal

In this project, I wanted to explore how various non-biological factors are related to the average number of cancer cases diagnosed, and the number average cancer deaths in the US. One study found that individuals on Medicare with low incomes had greater challenges in affording their healthcare (Park et al. 2025), highlighting that factors other than genetics and biology can impact health outcomes. This also showcases how certain populations are vulnerable to worse outcomes due to lack of access to healthcare. Discerning these variables can be useful in understanding barriers to important medical resources and how to improve access.

Research Questions

To explore these concepts, I found two data sets on Kaggle, one containing health-related information and the other containing demographic information, and combined them. I utilized exploratory data analysis and multiple linear regression models to understand what factors correlate with average number of cancer cases diagnosed and the average number of cancer deaths per year. My research questions were:

  1. How does the cancer target death rate and average number of cancer deaths vary geographically across U.S. states?
  2. Which socioeconomic and demographic factors are most strongly associated with the average number of cancer diagnoses per year across U.S. counties?
  3. Which socioeconomic and demographic factors are most strongly associated with the average number of cancer deaths per year across U.S. counties?

Column

Variables of Interest

  • avganncount: Average number of cancer cases diagnosed annually

  • avgdeathsperyear: Average number of deaths due to cancer per year

  • target_deathrate: Target death rate due to cancer

  • medincome: Median income in the region

  • povertypercent: Percentage of population below the poverty line

  • pctprivatecoveragealone: Percentage of population covered by private health insurance alone

  • pctempprivcoverage: Percentage of population covered by employee-provided private health insurance

  • pctpubliccoveragealone: Percentage of population covered by public health insurance only

  • pctwhite: Percentage of White population

  • pctblack: Percentage of Black population

  • pctasian: Percentage of Asian population

Other Variables

  • incidencerate: Incidence rate of cancer

  • popest2015: Estimated population in 2015

  • studypercap: Per capita number of cancer-related clinical trials conducted

  • binnedinc: Binned median income

  • medianage: Median age in the region

  • pctpubliccoverage: Percentage of population covered by public health insurance

  • pctotherrace: Percentage of population belonging to other races

  • pctmarriedhouseholds: Percentage of married households

  • birthrate: Birth rate in the region

  • statefips: The FIPS code representing the state

  • countyfips: The FIPS code representing the county or census area within the state

  • avghouseholdsize: The average household size in the region

  • geography: The geographical location, typically represented as the county or census area name followed by the state name

Data Cleaning

Before performing any analysis, I used plot_intro() to visualize the data set and realized there were missing values. I utilized plot_missing() to see what columns these values were in. To clean up the data, I removed missing values from pctprivatecoveragealone and pctemployed16_over as they were variables of interest to me. I also completely removed pctsomecol18_24 due to the large number of missing values.

Introduction to Data

Missing Value Distribution

Maps

Column

Target Death Rate

Average Death Rate

Column

Analysis

To explore whether cancer outcomes differ across regions of the United States, I computed state-level averages of the target death rate and the actual number of cancer deaths and visualized them using choropleth maps. These descriptive maps help reveal broad geographic patterns in cancer outcomes before moving to the regression analysis.

The maps reveal variations in the target death rate and average number of cancer deaths based on geographical location. The target death rate is the desired goal for deaths in an area, and is an age-adjusted benchmark rate per 100,000. The lowest target death rates are in Western states and the highest rates are in parts of Appalachia (West Virginia, Kentucky, Tennessee, Mississippi, etc.).

The map of the actual average death rates is the raw number of deaths, reflecting the population size. States with larger populations (such as California, Florida, and New York) naturally have higher death rates due to cancer, whereas states with smaller populations (such as Alaska and North Dakota) will have lower rates. Since the target death rate is a standardized value, and the actual value is a raw count, the two variables should not be directly compared. However, the geographical differences in rates indicate that there may be socioeconomic and demographic factors that explain why some states have larger cancer burdens than others, motivating the regression analysis later on.

EDA

Column

Diagnosis

Deaths

Insurance

Average Diagnoses

Average Deaths

Income

Average Diagnoses

Average Deaths

Race

Average Diagnoses

Average Deaths

Correlation

Average Diagnoses

log_avganncount pctprivatecoveragealone pctempprivcoverage pctpubliccoveragealone medincome povertypercent pctwhite pctblack pctasian
log_avganncount 1.00 0.33 0.38 -0.15 0.35 -0.22 -0.08 0.04 0.38
pctprivatecoveragealone 0.33 1.00 0.93 -0.86 0.79 -0.76 0.31 -0.27 0.28
pctempprivcoverage 0.38 0.93 1.00 -0.73 0.75 -0.68 0.27 -0.24 0.29
pctpubliccoveragealone -0.15 -0.86 -0.73 1.00 -0.72 0.80 -0.37 0.33 -0.18
medincome 0.35 0.79 0.75 -0.72 1.00 -0.79 0.16 -0.26 0.41
povertypercent -0.22 -0.76 -0.68 0.80 -0.79 1.00 -0.51 0.52 -0.15
pctwhite -0.08 0.31 0.27 -0.37 0.16 -0.51 1.00 -0.83 -0.27
pctblack 0.04 -0.27 -0.24 0.33 -0.26 0.52 -0.83 1.00 0.02
pctasian 0.38 0.28 0.29 -0.18 0.41 -0.15 -0.27 0.02 1.00

Average Deaths

log_avgdeathsperyear pctprivatecoveragealone pctempprivcoverage pctpubliccoveragealone medincome povertypercent pctwhite pctblack pctasian
log_avgdeathsperyear 1.00 0.21 0.31 -0.01 0.28 -0.08 -0.18 0.14 0.42
pctprivatecoveragealone 0.21 1.00 0.93 -0.86 0.79 -0.76 0.31 -0.27 0.28
pctempprivcoverage 0.31 0.93 1.00 -0.73 0.75 -0.68 0.27 -0.24 0.29
pctpubliccoveragealone -0.01 -0.86 -0.73 1.00 -0.72 0.80 -0.37 0.33 -0.18
medincome 0.28 0.79 0.75 -0.72 1.00 -0.79 0.16 -0.26 0.41
povertypercent -0.08 -0.76 -0.68 0.80 -0.79 1.00 -0.51 0.52 -0.15
pctwhite -0.18 0.31 0.27 -0.37 0.16 -0.51 1.00 -0.83 -0.27
pctblack 0.14 -0.27 -0.24 0.33 -0.26 0.52 -0.83 1.00 0.02
pctasian 0.42 0.28 0.29 -0.18 0.41 -0.15 -0.27 0.02 1.00

Column

Analysis

The distribution of average annual cancer diagnoses and average annual cancer deaths are both skewed right, indicating taking the log will be useful. For the rest of this analysis, I used the log of average annual cancer diagnoses and the log of average annual cancer deaths.

Insurance has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with private and employee-provided coverage having a positive correlation, and public coverage having a negative correlation.

Median income and poverty percentage have a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the former having a positive correlation, and the latter having a negative correlation.

Race has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the percentage of the White population having a negative correlation, and the percentage of Black and Asian populations having a positive correlation.

The correlation coefficients confirm these observations.

Methods

Column {data-width = 500}

Data Cleaning

As mentioned previously, data cleaning was performed to remove missing observations from pctprivatecoveragealone and pctemployed16_over, and to completely remove pctsomecol18_24. Furthermore, incidencerate and and target_deathrate were removed since they represent aspects of cancer frequency or mortality already shown in my response variables (avganncount and avgdeathsperyear). Similarly, log(avgdeathsperyear) was removed from the model for log(avganncount), and vice versa. Before fitting the models, binnedinc, geography, statefips, and countyfips were removed since they are non-numerical and non-quantitative values. Finally, since the histograms of avganncount and avgdeathsperyear displayed a skewed right distribution, I used the log of both response variables to make the data better suited for linear regression.

Models Fit

Since the predictors are continuous, linear regression was used. Backward stepwise selection was used to simplify the model by starting with all predictors and removing variables one-by-one that did not meaningfully contribute to explaining variation in the response variable. This approach is useful when many predictors may be redundant or correlated with each other, and the goal is to have a more interpretable model without sacrificing predictive performance. Backward selection keeps the important variables while removing weak predictors, making the final model easier to interpret in the context of socioeconomic and demographic factors that are related to cancer outcomes.

Model 1 is based on the average annual cancer diagnoses and uses log(avganncount) as the predictor. Model 2 focuses on the average annual cancer deaths and uses log(avgdeathsperyear) as the predictor. The theoretical model is as follows:

\[ \log (y) = \beta_0 +\beta_1 x_1 + \cdots + \beta_p x_p \]

with \(p\) being the total number of predictors in the model. Model 1 had 21 predictors, and model 2 had 18 predictors.

Modeling

Column

Average Cancer Diagnoses

term estimate std.error statistic p.value
(Intercept) 2.4023 1.0743 2.2361 0.0254
popest2015 0.0000 0.0000 16.7634 0.0000
povertypercent -0.0672 0.0084 -8.0383 0.0000
studypercap 0.0001 0.0000 1.8729 0.0612
medianagemale -0.0890 0.0128 -6.9649 0.0000
medianagefemale 0.0360 0.0137 2.6337 0.0085
percentmarried 0.0339 0.0103 3.2727 0.0011
pctnohs18_24 -0.0086 0.0034 -2.5143 0.0120
pcths18_24 -0.0047 0.0029 -1.6063 0.1083
pcths25_over -0.0281 0.0057 -4.9206 0.0000
pctbachdeg25_over 0.0197 0.0090 2.1899 0.0286
pctunemployed16_over 0.0758 0.0098 7.7457 0.0000
pctprivatecoverage 0.0860 0.0104 8.2849 0.0000
pctprivatecoveragealone -0.0962 0.0136 -7.0761 0.0000
pctempprivcoverage 0.0707 0.0075 9.4711 0.0000
pctpubliccoveragealone 0.0871 0.0091 9.5898 0.0000
pctwhite 0.0140 0.0035 3.9908 0.0001
pctblack 0.0114 0.0034 3.3974 0.0007
pctasian 0.0432 0.0116 3.7205 0.0002
pctmarriedhouseholds -0.0606 0.0108 -5.6143 0.0000
birthrate -0.0323 0.0114 -2.8218 0.0048
avghouseholdsize 0.4366 0.1962 2.2248 0.0262

Average Cancer Deaths

term estimate std.error statistic p.value
(Intercept) 4.0079 0.8557 4.6836 0.0000
popest2015 0.0000 0.0000 21.9905 0.0000
povertypercent -0.0804 0.0071 -11.3639 0.0000
medianagemale -0.0680 0.0062 -10.9857 0.0000
pctnohs18_24 -0.0156 0.0027 -5.8639 0.0000
pcths25_over -0.0068 0.0045 -1.5205 0.1285
pctbachdeg25_over 0.0525 0.0073 7.1964 0.0000
pctemployed16_over -0.0175 0.0046 -3.7948 0.0002
pctunemployed16_over 0.0788 0.0083 9.4938 0.0000
pctprivatecoverage 0.0533 0.0083 6.3841 0.0000
pctprivatecoveragealone -0.1066 0.0112 -9.5275 0.0000
pctempprivcoverage 0.0840 0.0060 14.0686 0.0000
pctpubliccoveragealone 0.0790 0.0073 10.8354 0.0000
pctwhite 0.0221 0.0028 7.8183 0.0000
pctblack 0.0213 0.0027 7.9458 0.0000
pctasian 0.0635 0.0094 6.7584 0.0000
pctmarriedhouseholds -0.0268 0.0049 -5.4347 0.0000
birthrate -0.0570 0.0092 -6.2091 0.0000
avghouseholdsize 0.2367 0.1311 1.8059 0.0711

Column {data.width = 400}

Analysis

Average Cancer Diagnoses:

Out of my variables of interest, povertypercent, pctprivatecoveragealone, pctempprivcoverage, pctpubliccoveragealone, pctwhite, pctblack, and pctasian were significant predictors in the model. Other variables included were median age, education status, marital status, and employment status. Interestingly, studypercap and pcths18_24 are not significant on their own in this model. However, since they are included, it is likely that these variables, combined with the others in the model, are significant predictors of annual average cancer diagnoses.

The adjusted R-squared for this model is 0.4616, the F-statistic is 96.18, and the p-value is less than 2.2e-16. Therefore, we reject the null hypothesis and have sufficient evidence to conclude that using the predictors in this model are better for predicting the average annual number of cancer diagnoses instead of the mean of cancer diagnoses.

A one-percentage-point increase in the poverty rate (\(β\) = –0.06722) is associated with an estimated 6.72% decrease in the expected average annual number of cancer diagnoses, holding all other predictors constant.

A one–percentage-point increase in the percent of people with private insurance only (\(β\) = -0.09620) is associated with a 9.62% decrease in expected average cancer diagnoses, controlling for all other variables. On the other hand, a one-percentage-point increase in the percent of people with employee-provided health insurance (\(β\) = 0.07372) is associated with a 7.37% increase in expected cancer diagnoses, holding other predictors constant. Similarly, a one-percentage-point increase in the percent of people covered only by public insurance (\(β\) = 0.08708) is associated with a 8.71% increase in expected cancer diagnoses, controlling for the other predictors.

For each of the race variables, a one-percentage-point increase in the White population (\(β\) = 0.01401) is associated with a 1.40% increase in expected diagnoses, a one-percentage-increase in the Black population (\(β\) = 0.01142) is associated with a 1.14% increase, and a one-point-percentage in the Asian population (\(β\) = 0.04316) increase is associated with a 4.32% increase, controlling for the other predictors.

Average Cancer Deaths:

As compared to model 1, model 2 has all of the same predictors except for studypercap, medianagefemale, and percentmarried. Similar to model 1, pcths25_over and avghouseholdsize are not significant in this model, however, it is likely that combining them with the other variables makes them significant predictors of the average number of cancer deaths.

The adjusted R-squared for this model is 0.5813, the F-statistic is 180.8, and the p-value is less than 2.2e-16. Therefore, we reject the null hypothesis and have sufficient evidence to conclude that using the predictors in this model are better for predicting the average annual number of cancer deaths instead of the mean of cancer deaths.

A one-percentage-point increase in the poverty rate (\(β\) = -0.08044) is associated with an estimated 8.04% decrease in the expected average annual number of cancer deaths, holding all other predictors constant.

A one–percentage-point increase in the percent of people with private insurance only (\(β\) = -0.1066) is associated with a 10.66% decrease in expected average cancer deaths, controlling for all other variables. On the other hand, a one-percentage-point increase in the percent of people with employee-provided health insurance (\(β\) = 0.08397) is associated with a 8.40% increase in expected cancer deaths, holding other predictors constant. Similarly, a one-percentage-point increase in the percent of people covered only by public insurance (\(β\) = 0.07903) is associated with a 7.90% increase in expected cancer deaths, controlling for the other predictors.

For each of the race variables, a one-percentage-point increase in the White population (\(β\) = 0.02205) is associated with a 2.21% increase in expected deaths, a one-percentage-increase in the Black population (\(β\) = 0.02127) is associated with a 2.13% increase, and a one-point-percentage in the Asian population (\(β\) = 0.06347) increase is associated with a 6.35% increase, controlling for the other predictors.

Conclusions:

Both models show that socioeconomic and demographic factors, including poverty, insurance type, race, education, employment, and household structure, are important predictors of cancer diagnoses and deaths across U.S. counties. Insurance coverage variables and poverty in particular show strong and consistent relationships with cancer outcomes. In addition, model 2 (deaths) explains more variation than model 1 (diagnoses) based on adjusted R-squared.

Adequacy Checking

Column

Diagnoses

Diagnostic Plots

Analysis

  • Residuals vs Fitted: The reference line is not flat and the points are not spread around it randomly. Therefore, it cannot be assumed there is a linear relationship.
  • Q-Q Residuals: The points mostly follow the 45 degree reference line, however, the points on the tails do deviate from it. Therefore, we can only assume the normality assumption of residuals is not severely violated.
  • Scale-Location: The reference line is not flat and the points are not evenly distributed around it. Therefore, the equal variance assumption is not met.
  • Residuals vs Leverage: There are no points outside of Cook’s distance, indicating that there are no high-leverage observations.

VIFS

             popest2015          povertypercent             studypercap 
               1.380668                6.075658                1.030168 
          medianagemale         medianagefemale          percentmarried 
               9.176966               10.659084               10.717408 
           pctnohs18_24              pcths18_24            pcths25_over 
               1.597176                1.481820                3.336086 
      pctbachdeg25_over    pctunemployed16_over      pctprivatecoverage 
               4.859703                2.415433               25.685712 
pctprivatecoveragealone      pctempprivcoverage  pctpubliccoveragealone 
              39.383930               10.441768                6.545672 
               pctwhite                pctblack                pctasian 
               7.084240                5.130543                1.926693 
   pctmarriedhouseholds               birthrate        avghouseholdsize 
              10.560077                1.163180                4.869214 

Analysis

Most of the VIFs have a value of 10 or lower, however, pctprivatecovereage and pctprivatecoveragealone have high VIFs. This high degree of multicollinearity is to be expected, however, as the percentage of the population covered by only private insurance would be included in the percentage of the population covered by private insurance.

Column

Deaths

Diagnostic Plots

Analysis

  • Residuals vs Fitted: The reference line is not flat and the points are not spread around it randomly. Therefore, it cannot be assumed there is a linear relationship.
  • Q-Q Residuals: The points mostly follow the 45 degree reference line, however, the points on the tails do deviate from it. Therefore, we can only assume the normality assumption of residuals is not severely violated.
  • Scale-Location: The reference line is not flat and the points are not evenly distributed around it. Therefore, the equal variance assumption is not met.
  • Residuals vs Leverage: There are no points outside of Cook’s distance, indicating that there are no high-leverage observations.

VIFS

             popest2015          povertypercent           medianagemale 
               1.372299                6.665621                3.293115 
           pctnohs18_24            pcths25_over       pctbachdeg25_over 
               1.502117                3.125778                4.876855 
     pctemployed16_over    pctunemployed16_over      pctprivatecoverage 
               4.553686                2.662661               25.410319 
pctprivatecoveragealone      pctempprivcoverage  pctpubliccoveragealone 
              40.806682               10.210974                6.465573 
               pctwhite                pctblack                pctasian 
               7.003089                4.985570                1.933623 
   pctmarriedhouseholds               birthrate        avghouseholdsize 
               3.384172                1.149018                3.326756 

Analysis

Most of the VIFs have a value of 10 or lower, however, pctprivatecovereage and pctprivatecoveragealone have high VIFs. This high degree of multicollinearity is to be expected, however, as the percentage of the population covered by only private insurance would be included in the percentage of the population covered by private insurance.

Conclusions

Column

Discussion

Poverty: In both models, higher poverty rates are associated with lower expected cancer diagnoses and deaths. Because the response is log-transformed, this means that a one-percentage-point increase in poverty corresponds to an estimated 6-8% decrease in expected cancer cases or deaths. This suggests that counties with higher poverty may have lower access to screening, diagnostic services, or cancer reporting, leading to fewer recorded diagnoses and deaths rather than a true reduction in cancer burden.

Insurance: Private-insurance-only coverage decreases as cancer diagnoses and deaths increase, whereas employee-provided private insurance and public-insurance-only coverage both increase. These patterns indicate that cancer outcomes are more strongly associated with counties where residents rely on employer-based or public insurance. Because the coefficients represent percentage changes in expected cancer outcomes, insurance coverage appears to be one of the strongest predictors across both models.

Race: The proportions of White, Black, and Asian residents are positively associated with both cancer diagnoses and deaths, with the largest effects observed among counties with higher Asian populations. This means that counties with higher percentages of these groups tend to have higher expected cancer outcomes, holding other factors constant.

Education, age, employment, and household characteristics: Several additional socioeconomic variables, including education levels, unemployment, marital status, birthrate, and household size, also showed significant associations. These results indicate that cancer outcomes are influenced by a broad set of demographic and economic factors, not only insurance and poverty.

Model 1 vs. Model 2: The percentage-change effects are generally larger in the cancer-deaths model than in the diagnoses model. This suggests that socioeconomic and demographic conditions may play an even greater role in predicting cancer mortality than diagnoses.

Addressing Research Questions

(1) Geographic patterns in cancer diagnoses and deaths: The EDA revealed clear geographic differences in cancer outcomes across U.S. states. States in the Midwest and Mountain West generally showed the lowest actual cancer death rates, well below their target ranges, while several Northeastern and Western states, including Connecticut, Massachusetts, Arizona, and California, experienced substantially higher death rates than their targets. These patterns suggest that state-level differences, such as healthcare access, population age structure, reporting practices, or environmental exposures, may contribute to large geographic disparities in cancer outcomes.

(2) Strongest predictors of cancer diagnoses: Insurance coverage variables (public insurance only, employer-provided insurance, and private insurance alone), poverty, race, and several socioeconomic indicators (education levels, unemployment, marital status, and birth rate) were the most strongly associated with average annual cancer diagnoses across U.S. counties.

(3) Strongest predictors of cancer deaths: The same categories of predictors remained important, but the effect sizes were generally larger in the deaths model. Poverty, insurance type, unemployment, race, marital status, and birthrate showed the strongest associations with cancer mortality.

Column

Limitations

Some of the important assumptions for linear regression were not fully met in my models. Therefore, there could be another modeling approach that better explains the relationship among the variables in this data set, and the results presented in my analysis should be interpreted with caution. Because several assumptions were not perfectly satisfied, the results should be viewed as exploratory associations rather than exact predictions. However, the large sample size and the log transformation help stabilize the estimates, so the overall patterns across the predictors are still useful to interpret.

Additionally, I deleted an entire variable from the data set for this analysis due to it having many missing values. It is possible that keeping this variable or handling the missing values differently might have led to a different model that better predicted cancer diagnoses and cancer deaths.

Future Directions

This analysis suggests that differences in insurance coverage, income, and race, as well as other non-biological factors, impact the rate of cancer diagnosing and deaths in the US. Future studies can focus on understanding the exact reasons and mechanisms behind why this is would be useful in correcting barriers to healthcare.

Other socioeconomic and demographic factors can also be explored that were not present in this data set. For example, collecting data on the number of people who went through with treatment could correlate to the average number of cancer deaths per year. Furthermore, it would be important to know whether the number of people who went through treatment was related to the cost or availability of the treatment.

Additionally, although the predictors were significant, the moderate R-squared value indicates that there are other variables not in the data that could correlate with cancer deaths and diagnoses. Genetics and other biological factors could be significantly impacting health outcomes relating to cancer.

Author

Column

About the Author

My name is Alexa Neal and I am a current senior at the University of Dayton. I am pursuing a Bachelor of Science in Premedicine, and minors in Data Analytics, Medicine and Society, and Neuroscience. My projected graduation is May 2026, and I will be attending medical school in the Fall of 2026.

As a Premedicine major and Medicine and Society minor, many of my classes revolve around the social, cultural, and environmental context of health and disease. For this project, I wanted to pick a data set that would allow me to explore these areas.

AI Acknowledgement

AI was used to understand and correct errors that occurred when coding. It was also used for code to clean up the presentation of the dashboard, for example, making a table to display the model coefficients and labeling states in the maps section.

Works Cited

Park, S., & Fung, V. (2025). Health Care Affordability Problems by Income Level and Subsidy Eligibility in Medicare. JAMA network open, 8(9), e2532862. https://doi.org/10.1001/jamanetworkopen.2025.32862

Column

---
title: "Cancer Analysis"
output: 
  flexdashboard::flex_dashboard:
    includes:
      in_header: header.html
    theme:
      version: 4
      bootswatch: lumen
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(DT)
library(tidyverse)
library(pacman)
library(dplyr)
library(DataExplorer)
library(car)
library(leaps)
library(MASS)
library(plotly)
library(usmap)
library(knitr)
library(broom)
library(sf)
library(ggplot2)
```



Introduction
===

Column {data-width=450}
---
### Background
**Title:** Socioeconomic and Demographic Impacts on Average Cancer Diagnosis and Deaths per Year in the US

**Author:** Alexa Neal

In this project, I wanted to explore how various non-biological factors are related to the average number of cancer cases diagnosed, and the number average cancer deaths in the US. [One study](https://pmc.ncbi.nlm.nih.gov/articles/PMC12455370/) found that individuals on Medicare with low incomes had greater challenges in affording their healthcare (Park et al. 2025), highlighting that factors other than genetics and biology can impact health outcomes. This also showcases how certain populations are vulnerable to worse outcomes due to lack of access to healthcare. Discerning these variables can be useful in understanding barriers to important medical resources and how to improve access. 

### Research Questions
To explore these concepts, I found two data sets on [Kaggle](https://www.kaggle.com/datasets/varunraskar/cancer-regression), one containing health-related information and the other containing demographic information, and combined them. I utilized exploratory data analysis and multiple linear regression models to understand what factors correlate with average number of cancer cases diagnosed and the average number of cancer deaths per year. My research questions were: 

1. How does the cancer target death rate and average number of cancer deaths vary geographically across U.S. states?
2. Which socioeconomic and demographic factors are most strongly associated with the average number of cancer diagnoses per year across U.S. counties?
3. Which socioeconomic and demographic factors are most strongly associated with the average number of cancer deaths per year across U.S. counties?


Column {.tabset data-width=550}
---
### Variables of Interest

- avganncount: Average number of cancer cases diagnosed annually

- avgdeathsperyear: Average number of deaths due to cancer per year

- target_deathrate: Target death rate due to cancer 

- medincome: Median income in the region

- povertypercent: Percentage of population below the poverty line

- pctprivatecoveragealone: Percentage of population covered by private health insurance alone

- pctempprivcoverage: Percentage of population covered by employee-provided private health insurance

- pctpubliccoveragealone: Percentage of population covered by public health insurance only

- pctwhite: Percentage of White population

- pctblack: Percentage of Black population

- pctasian: Percentage of Asian population


### Other Variables

- incidencerate: Incidence rate of cancer

- popest2015: Estimated population in 2015

- studypercap: Per capita number of cancer-related clinical trials conducted

- binnedinc: Binned median income

- medianage: Median age in the region

- pctpubliccoverage: Percentage of population covered by public health insurance

- pctotherrace: Percentage of population belonging to other races

- pctmarriedhouseholds: Percentage of married households

- birthrate: Birth rate in the region

- statefips: The FIPS code representing the state

- countyfips: The FIPS code representing the county or census area within the state

- avghouseholdsize: The average household size in the region

- geography: The geographical location, typically represented as the county or census area name followed by the state name

### Data Cleaning

Before performing any analysis, I used plot_intro() to visualize the data set and realized there were missing values. I utilized plot_missing() to see what columns these values were in. To clean up the data, I removed missing values from pctprivatecoveragealone and pctemployed16_over as they were variables of interest to me. I also completely removed pctsomecol18_24 due to the large number of missing values. 

#### Introduction to Data

```{r cleaning}
## reading + joining the data
household <- read.csv("C:/Users/write/OneDrive/Desktop/school/regression files/avg-household-size.csv")
cancer_reg <- read.csv("C:/Users/write/OneDrive/Desktop/school/regression files/cancer_reg (1).csv")
cancer <- left_join(cancer_reg, household)

plot_intro(cancer)
```


#### Missing Value Distribution

```{r missing value distribution}

plot_missing(cancer)

cancer <- cancer %>% 
  drop_na(pctprivatecoveragealone, pctemployed16_over) %>%  
  dplyr::select(-c(pctsomecol18_24))
```


Maps
===

Column {.tabset data-width=700}
---

### Target Death Rate

```{r target death rate map}
cancer_target_map <- cancer %>%
  group_by(statefips) %>%
  summarize(target_deathrate = mean(target_deathrate, na.rm = TRUE))

cancer_target_map$fips <- sprintf("%02d", cancer_target_map$statefips)

p1 <- plot_usmap(data = cancer_target_map, values = "target_deathrate") +
  scale_fill_viridis_c(option="D", direction=-1, name="Avg. Target Death Rate")

map_sf <- usmap::us_map("states") %>%
  sf::st_as_sf()

centroids <- map_sf %>%
  mutate(centroid = sf::st_centroid(geom)) %>%
  mutate(
    x = sf::st_coordinates(centroid)[,1],
    y = sf::st_coordinates(centroid)[,2]
  ) %>%
  select(fips, x, y)

state_lookup <- usmap::fips_info() %>%
  filter(!abbr %in% c("DC", "PR"))  

centroids_labeled <- centroids %>%
  left_join(cancer_target_map, by = "fips") %>%
  left_join(state_lookup, by = "fips")

p1_labeled <- p1 +
  geom_text(
    data = centroids_labeled,
    aes(x = x, y = y, label = abbr),  # change to `full` for full state names
    size = 3,
    fontface = "bold",
    inherit.aes = FALSE
  )

ggplotly(p1_labeled, tooltip = c("target_deathrate", "full"), width=790, height=552)

```

### Average Death Rate

```{r average deaths map}
cancer_actualdeaths_map <- cancer %>%
  group_by(statefips) %>%
  summarize(avgdeathsperyear = mean(avgdeathsperyear, na.rm = TRUE))

cancer_actualdeaths_map$fips <- sprintf("%02d", cancer_actualdeaths_map$statefips)

p2 <- plot_usmap(data = cancer_actualdeaths_map, values = "avgdeathsperyear") +
  scale_fill_viridis_c(option="D", direction=-1, name="Avg. Death Rate")

map_sf <- usmap::us_map("states") %>%
  sf::st_as_sf()

centroids <- map_sf %>%
  mutate(centroid = sf::st_centroid(geom)) %>%
  mutate(
    x = sf::st_coordinates(centroid)[,1],
    y = sf::st_coordinates(centroid)[,2]
  ) %>%
  select(fips, x, y)

state_lookup <- usmap::fips_info() %>%
  filter(!abbr %in% c("DC", "PR"))  

centroids_labeled2 <- centroids %>%
  left_join(cancer_actualdeaths_map, by = "fips") %>%
  left_join(state_lookup, by = "fips")

p2_labeled <- p2 +
  geom_text(
    data = centroids_labeled2,
    aes(x = x, y = y, label = abbr),  # change to `full` for full state names
    size = 3,
    fontface = "bold",
    inherit.aes = FALSE
  )

ggplotly(p2_labeled, tooltip = c("avgdeathsperyear", "full"), width=790, height=552)
```


Column {data-width=300}
---

### Analysis
To explore whether cancer outcomes differ across regions of the United States, I computed state-level averages of the target death rate and the actual number of cancer deaths and visualized them using choropleth maps. These descriptive maps help reveal broad geographic patterns in cancer outcomes before moving to the regression analysis.

The maps reveal variations in the target death rate and average number of cancer deaths based on geographical location. The target death rate is the desired goal for deaths in an area, and is an age-adjusted benchmark rate per 100,000. The lowest target death rates are in Western states and the highest rates are in parts of Appalachia (West Virginia, Kentucky, Tennessee, Mississippi, etc.).

The map of the actual average death rates is the raw number of deaths, reflecting the population size. States with larger populations (such as California, Florida, and New York) naturally have higher death rates due to cancer, whereas states with smaller populations (such as Alaska and North Dakota) will have lower rates. Since the target death rate is a standardized value, and the actual value is a raw count, the two variables should not be directly compared. However, the geographical differences in rates indicate that there may be socioeconomic and demographic factors that explain why some states have larger cancer burdens than others, motivating the regression analysis later on. 

EDA
===

Column {.tabset data-width=600}
---

### Diagnosis

```{r diagnosis histogram}
ggplot(cancer, aes(x = avganncount)) +
  geom_histogram(fill = "#2A4C65") + labs(title = "Distribution of Cancer Diagnoses",
                          x = "Average Diagnoses Per Year", y = "Frequency")
```

### Deaths

```{r death histogram}
ggplot(cancer, aes(x = avgdeathsperyear)) +
  geom_histogram(fill = "#2A4C65") + 
  labs(title = "Distribution of Cancer Deaths",
                          x = "Average Deaths Per Year", y = "Frequency")
```

### Insurance

#### Average Diagnoses

```{r insurance diagnosis}
cancer$log_avganncount <- log(cancer$avganncount)

pairs(~log_avganncount + pctprivatecoveragealone
      + pctempprivcoverage + pctpubliccoveragealone, data = cancer)
```

#### Average Deaths

```{r insurance death}
cancer$log_avgdeathsperyear <- log(cancer$avgdeathsperyear)

pairs(~log_avgdeathsperyear + pctprivatecoveragealone
      + pctempprivcoverage + pctpubliccoveragealone, data = cancer)
```

### Income

#### Average Diagnoses

```{r income diagnoses}
pairs(~log_avganncount + medincome + povertypercent, data = cancer)
```

#### Average Deaths

```{r income death}
pairs(~log_avgdeathsperyear + medincome + povertypercent, data = cancer)
```

### Race

#### Average Diagnoses

```{r race diagnoses}
pairs(~log_avganncount + pctwhite + pctblack + pctasian,
      data = cancer)
```

#### Average Deaths

```{r race death}
pairs(~log_avgdeathsperyear + pctwhite + pctblack + pctasian,
      data = cancer)
```


### Correlation

#### Average Diagnoses

```{r diagnoses rho}
rho1 <- cor(cancer[,c("log_avganncount", "pctprivatecoveragealone", "pctempprivcoverage", "pctpubliccoveragealone", "medincome", "povertypercent", "pctwhite", "pctblack", "pctasian")])
         
rho1 <- round(rho1, 2)

knitr::kable(rho1, align = "c")
```

#### Average Deaths

```{r deaths rho}
rho2 <- cor(cancer[,c("log_avgdeathsperyear", "pctprivatecoveragealone", "pctempprivcoverage", "pctpubliccoveragealone", "medincome", "povertypercent", "pctwhite",
"pctblack", "pctasian")])
         
rho2 <- round(rho2, 2)

knitr::kable(rho2, align = "c")
```

Column {data-width=400}
---
### Analysis

The distribution of average annual cancer diagnoses and average annual cancer deaths are both skewed right, indicating taking the log will be useful. For the rest of this analysis, I used the log of average annual cancer diagnoses and the log of average annual cancer deaths.

Insurance has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with private and employee-provided coverage having a positive correlation, and public coverage having a negative correlation.

Median income and poverty percentage have a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the former having a positive correlation, and the latter having a negative correlation.

Race has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the percentage of the White population having a negative correlation, and the percentage of Black and Asian populations having a positive correlation.

The correlation coefficients confirm these observations. 


Methods
===

Column {data-width = 500}
---
### Data Cleaning
As mentioned previously, data cleaning was performed to remove missing observations from pctprivatecoveragealone and pctemployed16_over, and to completely remove pctsomecol18_24. Furthermore, incidencerate and and target_deathrate were removed since they represent aspects of cancer frequency or mortality already shown in my response variables (avganncount and avgdeathsperyear). Similarly, log(avgdeathsperyear) was removed from the model for log(avganncount), and vice versa. Before fitting the models, binnedinc, geography, statefips, and countyfips were removed since they are non-numerical and non-quantitative values. Finally, since the histograms of avganncount and avgdeathsperyear displayed a skewed right distribution, I used the log of both response variables to make the data better suited for linear regression. 

### Models Fit 
Since the predictors are continuous, linear regression was used. Backward stepwise selection was used to simplify the model by starting with all predictors and removing variables one-by-one that did not meaningfully contribute to explaining variation in the response variable. This approach is useful when many predictors may be redundant or correlated with each other, and the goal is to have a more interpretable model without sacrificing predictive performance. Backward selection keeps the important variables while removing weak predictors, making the final model easier to interpret in the context of socioeconomic and demographic factors that are related to cancer outcomes.

Model 1 is based on the average annual cancer diagnoses and uses log(avganncount) as the predictor. Model 2 focuses on the average annual cancer deaths and uses log(avgdeathsperyear) as the predictor. The theoretical model is as follows: 

\[
\log (y) = \beta_0 +\beta_1 x_1 + \cdots + \beta_p x_p
\]

with $p$ being the total number of predictors in the model. Model 1 had 21 predictors, and model 2 had 18 predictors. 

Modeling 
===

Column {.tabset data-width=600}
---

### Average Cancer Diagnoses

```{r}
cancer_diagnoses <- cancer %>%  
  dplyr::select(-c(binnedinc, geography, statefips, countyfips, 
                   avganncount, avgdeathsperyear, target_deathrate,
                   incidencerate, log_avgdeathsperyear))

full.cancer.diagnoses <- lm(log_avganncount ~ ., data = cancer_diagnoses)
fit.backward.diagnoses <- stepAIC(full.cancer.diagnoses, direction = "backward", trace = FALSE)

model1 <- summary(fit.backward.diagnoses)

tidy(model1) %>%
  kable(digits = 4)
```

### Average Cancer Deaths

```{r}
cancer_deaths <- cancer %>%  
  dplyr::select(-c(binnedinc, geography, statefips, countyfips, 
            avganncount, avgdeathsperyear, target_deathrate,
            incidencerate, log_avganncount))

full.cancer.deaths <- lm(log_avgdeathsperyear ~ ., data = cancer_deaths)
fit.backward.deaths <- stepAIC(full.cancer.deaths, direction = "backward", trace = FALSE)

model2 <- summary(fit.backward.deaths)

tidy(model2) %>%
  kable(digits = 4)
```


Column {data.width = 400}
---
### Analysis
**Average Cancer Diagnoses:**

Out of my variables of interest, povertypercent, pctprivatecoveragealone, pctempprivcoverage, pctpubliccoveragealone, pctwhite, pctblack, and pctasian were significant predictors in the model. Other variables included were median age, education status, marital status, and employment status. Interestingly, studypercap and pcths18_24 are not significant on their own in this model. However, since they are included, it is likely that these variables, combined with the others in the model, are significant predictors of annual average cancer diagnoses. 

The adjusted R-squared for this model is 0.4616, the F-statistic is 96.18, and the p-value is less than 2.2e-16. Therefore, we reject the null hypothesis and have sufficient evidence to conclude that using the predictors in this model are better for predicting the average annual number of cancer diagnoses instead of the mean of cancer diagnoses.

A one-percentage-point increase in the poverty rate ($β$ = –0.06722) is associated with an estimated 6.72% decrease in the expected average annual number of cancer diagnoses, holding all other predictors constant.

A one–percentage-point increase in the percent of people with private insurance only ($β$ = -0.09620) is associated with a 9.62% decrease in expected average cancer diagnoses, controlling for all other variables. On the other hand, a one-percentage-point increase in the percent of people with employee-provided health insurance ($β$ = 0.07372) is associated with a 7.37% increase in expected cancer diagnoses, holding other predictors constant. Similarly, a one-percentage-point increase in the percent of people covered only by public insurance ($β$ = 0.08708) is associated with a 8.71% increase in expected cancer diagnoses, controlling for the other predictors. 

For each of the race variables, a one-percentage-point increase in the White population ($β$ = 0.01401) is associated with a 1.40% increase in expected diagnoses, a one-percentage-increase in the Black population ($β$ = 0.01142) is associated with a 1.14% increase, and a one-point-percentage in the Asian population ($β$ = 0.04316) increase is associated with a 4.32% increase, controlling for the other predictors. 

**Average Cancer Deaths:**

As compared to model 1, model 2 has all of the same predictors except for studypercap, medianagefemale, and percentmarried. Similar to model 1, pcths25_over and avghouseholdsize are not significant in this model, however, it is likely that combining them with the other variables makes them significant predictors of the average number of cancer deaths. 

The adjusted R-squared for this model is 0.5813, the F-statistic is 180.8, and the p-value is less than 2.2e-16. Therefore, we reject the null hypothesis and have sufficient evidence to conclude that using the predictors in this model are better for predicting the average annual number of cancer deaths instead of the mean of cancer deaths.

A one-percentage-point increase in the poverty rate ($β$ = -0.08044) is associated with an estimated 8.04% decrease in the expected average annual number of cancer deaths, holding all other predictors constant.

A one–percentage-point increase in the percent of people with private insurance only ($β$ = -0.1066) is associated with a 10.66% decrease in expected average cancer deaths, controlling for all other variables. On the other hand, a one-percentage-point increase in the percent of people with employee-provided health insurance ($β$ = 0.08397) is associated with a 8.40% increase in expected cancer deaths, holding other predictors constant. Similarly, a one-percentage-point increase in the percent of people covered only by public insurance ($β$ = 0.07903) is associated with a 7.90% increase in expected cancer deaths, controlling for the other predictors. 

For each of the race variables, a one-percentage-point increase in the White population ($β$ = 0.02205) is associated with a 2.21% increase in expected deaths, a one-percentage-increase in the Black population ($β$ = 0.02127) is associated with a 2.13% increase, and a one-point-percentage in the Asian population ($β$ = 0.06347) increase is associated with a 6.35% increase, controlling for the other predictors.

**Conclusions:**

Both models show that socioeconomic and demographic factors, including poverty, insurance type, race, education, employment, and household structure, are important predictors of cancer diagnoses and deaths across U.S. counties. Insurance coverage variables and poverty in particular show strong and consistent relationships with cancer outcomes. In addition, model 2 (deaths) explains more variation than model 1 (diagnoses) based on adjusted R-squared.


Adequacy Checking
===
Column {.tabset data.width=500}
---
### Diagnoses

#### Diagnostic Plots

```{r}
par(mfrow=c(2,2))
plot(fit.backward.diagnoses)
```

#### Analysis

- Residuals vs Fitted: The reference line is not flat and the points are not spread around it randomly. Therefore, it cannot be assumed there is a linear relationship.
- Q-Q Residuals: The points mostly follow the 45 degree reference line, however, the points on the tails do deviate from it. Therefore, we can only assume the normality assumption of residuals is not severely violated.
- Scale-Location: The reference line is not flat and the points are not evenly distributed around it. Therefore, the equal variance assumption is not met. 
- Residuals vs Leverage: There are no points outside of Cook's distance, indicating that there are no high-leverage observations.

### VIFS

```{r}
vif(fit.backward.diagnoses)
```

#### Analysis

Most of the VIFs have a value of 10 or lower, however, pctprivatecovereage and pctprivatecoveragealone have high VIFs. This high degree of multicollinearity is to be expected, however, as the percentage of the population covered by only private insurance would be included in the percentage of the population covered by private insurance. 

Column {.tabset data.width=500}
---

### Deaths

#### Diagnostic Plots

```{r}
par(mfrow=c(2,2))
plot(fit.backward.deaths)
```

#### Analysis

- Residuals vs Fitted: The reference line is not flat and the points are not spread around it randomly. Therefore, it cannot be assumed there is a linear relationship.
- Q-Q Residuals: The points mostly follow the 45 degree reference line, however, the points on the tails do deviate from it. Therefore, we can only assume the normality assumption of residuals is not severely violated.
- Scale-Location: The reference line is not flat and the points are not evenly distributed around it. Therefore, the equal variance assumption is not met. 
- Residuals vs Leverage: There are no points outside of Cook's distance, indicating that there are no high-leverage observations.

### VIFS

```{r}
vif(fit.backward.deaths)
```

#### Analysis
Most of the VIFs have a value of 10 or lower, however, pctprivatecovereage and pctprivatecoveragealone have high VIFs. This high degree of multicollinearity is to be expected, however, as the percentage of the population covered by only private insurance would be included in the percentage of the population covered by private insurance. 

Conclusions
===
Column {.tabset data.width=500}
---
### Discussion
**Poverty:** In both models, higher poverty rates are associated with lower expected cancer diagnoses and deaths. Because the response is log-transformed, this means that a one-percentage-point increase in poverty corresponds to an estimated 6-8% decrease in expected cancer cases or deaths. This suggests that counties with higher poverty may have lower access to screening, diagnostic services, or cancer reporting, leading to fewer recorded diagnoses and deaths rather than a true reduction in cancer burden.

**Insurance:** Private-insurance-only coverage decreases as cancer diagnoses and deaths increase, whereas employee-provided private insurance and public-insurance-only coverage both increase. These patterns indicate that cancer outcomes are more strongly associated with counties where residents rely on employer-based or public insurance. Because the coefficients represent percentage changes in expected cancer outcomes, insurance coverage appears to be one of the strongest predictors across both models.

**Race:** The proportions of White, Black, and Asian residents are positively associated with both cancer diagnoses and deaths, with the largest effects observed among counties with higher Asian populations. This means that counties with higher percentages of these groups tend to have higher expected cancer outcomes, holding other factors constant. 

**Education, age, employment, and household characteristics:**
Several additional socioeconomic variables, including education levels, unemployment, marital status, birthrate, and household size, also showed significant associations. These results indicate that cancer outcomes are influenced by a broad set of demographic and economic factors, not only insurance and poverty.

**Model 1 vs. Model 2:** The percentage-change effects are generally larger in the cancer-deaths model than in the diagnoses model. This suggests that socioeconomic and demographic conditions may play an even greater role in predicting cancer mortality than diagnoses.

### Addressing Research Questions

**(1) Geographic patterns in cancer diagnoses and deaths:**
The EDA revealed clear geographic differences in cancer outcomes across U.S. states. States in the Midwest and Mountain West generally showed the lowest actual cancer death rates, well below their target ranges, while several Northeastern and Western states, including Connecticut, Massachusetts, Arizona, and California, experienced substantially higher death rates than their targets. These patterns suggest that state-level differences, such as healthcare access, population age structure, reporting practices, or environmental exposures, may contribute to large geographic disparities in cancer outcomes.

**(2) Strongest predictors of cancer diagnoses:**
Insurance coverage variables (public insurance only, employer-provided insurance, and private insurance alone), poverty, race, and several socioeconomic indicators (education levels, unemployment, marital status, and birth rate) were the most strongly associated with average annual cancer diagnoses across U.S. counties.

**(3) Strongest predictors of cancer deaths:**
The same categories of predictors remained important, but the effect sizes were generally larger in the deaths model. Poverty, insurance type, unemployment, race, marital status, and birthrate showed the strongest associations with cancer mortality.


  

Column {data.width=500}
---

### Limitations
Some of the important assumptions for linear regression were not fully met in my models. Therefore, there could be another modeling approach that better explains the relationship among the variables in this data set, and the results presented in my analysis should be interpreted with caution. Because several assumptions were not perfectly satisfied, the results should be viewed as exploratory associations rather than exact predictions. However, the large sample size and the log transformation help stabilize the estimates, so the overall patterns across the predictors are still useful to interpret. 

Additionally, I deleted an entire variable from the data set for this analysis due to it having many missing values. It is possible that keeping this variable or handling the missing values differently might have led to a different model that better predicted cancer diagnoses and cancer deaths.

### Future Directions
This analysis suggests that differences in insurance coverage, income, and race, as well as other non-biological factors, impact the rate of cancer diagnosing and deaths in the US. Future studies can focus on understanding the exact reasons and mechanisms behind why this is would be useful in correcting barriers to healthcare. 

Other socioeconomic and demographic factors can also be explored that were not present in this data set. For example, collecting data on the number of people who went through with treatment could correlate to the average number of cancer deaths per year. Furthermore, it would be important to know whether the number of people who went through treatment was related to the cost or availability of the treatment.

Additionally, although the predictors were significant, the moderate  R-squared value indicates that there are other variables not in the data that could correlate with cancer deaths and diagnoses. Genetics and other biological factors could be significantly impacting health outcomes relating to cancer.

Author
===

Column {data-width=500}
---
### About the Author
My name is Alexa Neal and I am a current senior at the University of Dayton. I am pursuing a Bachelor of Science in Premedicine, and minors in Data Analytics, Medicine and Society, and Neuroscience. My projected graduation is May 2026, and I will be attending medical school in the Fall of 2026.

As a Premedicine major and Medicine and Society minor, many of my classes revolve around the social, cultural, and environmental context of health and disease. For this project, I wanted to pick a data set that would allow me to explore these areas.

### AI Acknowledgement
AI was used to understand and correct errors that occurred when coding. It was also used for code to clean up the presentation of the dashboard, for example, making a table to display the model coefficients and labeling states in the maps section. 

### Works Cited
Park, S., & Fung, V. (2025). Health Care Affordability Problems by Income Level and Subsidy Eligibility in Medicare. JAMA network open, 8(9), e2532862. https://doi.org/10.1001/jamanetworkopen.2025.32862

Column {data-width=500}
---
###

```{r headshot}
knitr::include_graphics("C:/Users/write/OneDrive/Desktop/photos/headshot.JPG")
```