This blog post will conduct analysis at a county level within California to look at the impact of tree canopy cover percentages on the likelihood of dying after COVID-19 is contracted.
What impact does the percentage of tree canopy cover have on the rate of COVID deaths per confirmed positive case in California?
This is an important and interesting environmental justice question to consider because tree canopy cover percentages and COVID-19 disproportionately impacted historically marginalized populations throughout the United States. Additionally, both are linked to respiratory and cardiovascular diseases. There is proven evidence that areas with more urban tree canopy have increased public health indicators (6). This includes lower rates of disease such as asthma, strokes, and cardiac disease. There is also evidence that individuals with existing respiratory and/or cardiovascular disease are not only more likely to contract COVID-19, but also more likely to be sicker or even die (8).
The tree canopy data used for this analysis is publicly available from the Public Health Alliance of Southern California (2), which reports California Healthy Places Indexes. The data is available in CSV format, with a canopy cover percentage for each census tract within California. The website was last updated in April of 2021, but the tree canopy data is from 2011.
The COVID-19 data used in this analysis was publicly available on the LA Times DataDesk GitHub repository (1). This data was collected using scrapers written in Python and Jupyter notebooks, scheduled and run via GitHub Actions, and archived using git. The scrapers collection data from the California Department of Public Health and other government agencies. This data is at a county-level spatial resolution and includes a daily number for both confirmed cases and deaths from February 1st, 2020 to today. The data used in this analysis included daily numbers from February 1st, 2020 through November 22, 2021. Potential bias in this data, is that confirmed cases are calculated based on positive test results. This means that any individuals who contracted COVID-19, but did not get tested, are not included in this data. Due to the high rate of asymptomatic cases (9), there is likely a large quantity of missing data.
The geographic data used in this analysis includes California county borders and U.S. Census regions, which subsets the state into 10 different regions. The county geographies were downloaded as a ShapeFile from the LA Times DataDesk GitHub repository (3). The U.S. Census regions were manually entered into R based on a map publicly available on the U.S. Census website (5).
For my analysis I planned to conduct a simple OLS linear regression, but first conducted some basic analysis to explore the data. To begin, I needed to transform the tree canopy and COVID-19 data to be at the same spatial and temporal resolution. First, I calculated the tree canopy cover percentage for each county using the group_by()
and summarize()
functions to create an average from the census tract data. Next, I calculated each county’s average population, average daily number of confirmed positive cases, and average daily reported deaths also using group_by()
and summarize()
functions. Finally, I calculated a rate of deaths per confirmed case and per capita for each county. Once this was completed, I combined all datasets based on the county Federal Information Processing System (FIPS) codes to create a dataframe including tree canopy, COVID-19, and income data as well as geometries for each county.
Before deciding to use a simple OLS linear regression, I wanted to conduct some basic data visualization to explore the correlation between tree canopy and the rate of COVID-19 deaths per positive case.
First, I aggregated the county data further into the 10 county regions as defined by the U.S. Census and plotted the canopy cover percentages and COVID-19 deaths per capita for each region (Fig 1). This exploration showed a potential relationship between lower canopied regions within California and COVID-19 deaths per capita.
Next, I honed my exploration more closely in on my research questions: what impact does tree canopy coverage have on the likelihood that someone who contracts COVID-19 will die? To do this I created two maps. One map shows the average tree canopy cover percentage for each county within California (Fig 2). The other map displays the rate of average daily COVID-19 deaths per average daily confirmed positive cases for each county within California (Fig 2). As with my previous exploration this visualization does not show anything conclusively, but does indicate that there is a relationship between California counties with lower tree canopy cover and higher likelihood of death for individuals who contract COVID-19.
I conducted a simple OLS linear regression to look at the impacts of tree canopy cover percentages on the rate of COVID-19 deaths per positive case. First, I used ggplot
to create a scatter plot comparing the tree canopy cover percentage to the rate of COVID-19 deaths per positive case for each of the 57 California counties. Then I used geom_smooth()
to plot a simple OLS linear regression of the data. Visually, it appears there is a negative correlation between the two rates (Fig 3).
The results of the OLS regression are shown in figure 4. The results show that there is in fact a negative correlation between tree canopy cover percentage and the rate of COVID-19 deaths per positive cases within California counties. The slope results indicates that for each 1% increase in tree canopy there is a 0.008% decrease in the COVID-19 death rate per positive case. However, the p-value is 0.11 meaning that there is no statistical significance that can be taken from this analysis. Additionally, the R-squared value is 0.05, meaning that only 5% of the variation in the rate of COVID-19 deaths per positive case are explained by average tree canopy cover percentages.
Intercept | Slope | P-value | R-Squared |
---|---|---|---|
0.0130365 | -0.0000772 | 0.107893 | 0.0463171 |
Because of the lack of statistical significance discussed above, we are unable to reject the null hypothesis with the current analysis.
Null hypothesis: In California counties, the tree canopy cover percentage has no impact on the rate of COVID-19 deaths per positive reported case.
Alternative hypothesis: In California counties, the tree canopy cover percentage has an impact on the rate of COVID-19 deaths per positive reported case.
I calculated a confidence interval and found that I was 95% confident that the true change in COVID-19 deaths per positive case for each 1% increase in tree canopy cover percentages was withing the range of (-0.000172, 0.000017).
As discussed above, there is no statistically significant conclusion to be taken from this analysis. However, I believe that future analysis is warranted.
The analysis that I conducted only included 57 observations, one for each California county included in the original tree data. This is a low number of observations, so there may be a different result if analysis were conducted looking at all 3,006 counties within the United States.
Other future analysis would be to include median income in the regression model. Because those with higher median incomes generally live in communities with higher tree canopy cover percentages (7), it is possible that the negative correlation we saw was more not due to tree canopy. For future analysis I would use median income data from the United States Department of Agriculture (USDA) Economic Research Service website (4), which has 2019 median income data is aggregated to a county level.
Lastly, the data used in this analysis was from 2011. While tree canopy cover tends to change slowly, there have been technological advances in the last decade (such as LiDAR) that allow for more accurate tree canopy cover percentage estimates.