SPATIAL ANALYSIS OF COUNTY-LEVEL DIABETES PREVALENCE IN THE USA: A MACHINE LEARNING APPROACH

Edmund Twumasi Ampofo

Diabetes poses a major public health challenge in the United States, ranking among the top ten leading causes of death. Its prevalence is closely tied to factors such as obesity and lifestyle behaviors, yet it varies significantly across different geographic regions. Global regression models often fail to capture the entirety of the relationships between dependent and independent variables, especially when spatial heterogeneity is present. To understand county-level diabetes prevalence and its associated risk factors, researchers have employed spatial linear regression models, which have limitations, including the assumption of linear relationships and inadequate handling of multicollinearity. To address this, a geographically weighted (GW) random forest model (RF), which combines random forests and locally weighted regressions via a spatially weighted matrix, is employed as an exploratory and predictive tool in this study. County-level diabetes prevalence data for the USA, along with twelve other independent variables, from 2010 to 2020 was divided into two time periods: pre-and post-NHIS survey updates (referred to as ``historical'' and ``current'' periods, respectively). These data were then used to explore the nature and pattern of county-level diabetes prevalence and to estimate the performance of GW-RF against other global and spatially weighted models. In this study, we found that all geographically weighted models outperformed their non-spatial counterparts across periods, indicating that spatial variation plays an important role in explaining county-level diabetes prevalence. Our results further indicate that the GW-RF model more effectively captures spatial heterogeneity and predicts diabetes prevalence than both global and local models. Compared to global (G) ordinary least squares (OLS) regression, RF, and GW-OLS, it achieved higher $R^2$ values by 3.5%, 1.1%, and 0.6% (historic), and 2.3%, 0.5%, and 0.4% (current), as well as lower NRMSE values by 6.1%, 2%, and 1% (historic), and 0.8%, 0.3%, and 0.2% (current), respectively. We also found that, although models generally performed well, their performance dropped in the current period. This decline in model performance may be because the current period showed less spatial autocorrelation in diabetes prevalence (historical Moran’s I: 0.559, p <0.001; current Moran's I: 0.45, p < 0.001). This shift in the underlying spatial patterns of diabetes could reflect known changes in survey methodology or actual epidemiological changes, both of which warrant further investigation. The findings also suggest that the GW-RF model can support health professionals and policymakers in making accurate projections, detecting emerging hotspots, and guiding targeted prevention and control efforts.

SPATIAL ANALYSIS OF COUNTY-LEVEL DIABETES PREVALENCE IN THE USA: A MACHINE LEARNING APPROACH

Abstract

Files and links (1)

Metrics

Details