Abstract
COVID-19 has been at the forefront of global concern since its emergence in Decemberof 2019. Determining the social factors that drive case incidence is paramount to mitigating
disease spread. Simple predictive analysis in the form of multiple regression proves to be
an inefficient method for predicting COVID-19 case rate using sociodemographic factors, as
many of these factors are collinear; additionally, multiple regression is insufficient as this
technique results in models that overfit the data, meaning the models cannot generalize
when given new data and thus perform poorly. As such, biased estimation through elastic
net regression was used to conduct a broad-based analysis across the ten HHS health regions
for both the pre-Delta (March 22, 2020 to June 15, 2021) and Delta (June 15, 2021 to
November 1, 2021) waves of the COVID-19 pandemic. Statistically, elastic net proved to be
much more accurate in its prediction when compared to multiple regression, as almost every
HHS model consistently had a lower root mean square error (RMSE); additionally, these
models also succeeded in remedying overfitting through verification by way of training/testing
R2 evaluation. From an epidemiological standpoint, this research confirmed many of the
known trends in terms of social factors that influence case incidence (such as group quarters
percentage or mobile home percentage per county), while also discovering interesting trends
occurring across different waves of the pandemic that give insight into the effect of measures
such as vaccination. This research provides a novel approach to modeling sociodemographic
risk factors against COVID-19 case rate which can easily be expanded upon in the future
with a more robust set of sociodemographic factors.