Knight Gabriel M, Spencer-Bonilla Gabriela, Maahs David M, Blum Manuel R, Valencia Areli, Zuma Bongeka Z, Prahalad Priya, Sarraju Ashish, Rodriguez Fatima, Scheinker David
Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA.
Department of Medicine, Stanford University School of Medicine, Stanford, California, USA.
BMJ Open Diabetes Res Care. 2020 Nov;8(2). doi: 10.1136/bmjdrc-2020-001725.
Population-level and individual-level analyses have strengths and limitations as do 'blackbox' machine learning (ML) and traditional, interpretable models. Diabetes mellitus (DM) is a leading cause of morbidity and mortality with complex sociodemographic dynamics that have not been analyzed in a way that leverages population-level and individual-level data as well as traditional epidemiological and ML models. We analyzed complementary individual-level and county-level datasets with both regression and ML methods to study the association between sociodemographic factors and DM.
County-level DM prevalence, demographics, and socioeconomic status (SES) factors were extracted from the 2018 Robert Wood Johnson Foundation County Health Rankings and merged with US Census data. Analogous individual-level data were extracted from 2007 to 2016 National Health and Nutrition Examination Survey studies and corrected for oversampling with survey weights. We used multivariate linear (logistic) regression and ML regression (classification) models for county (individual) data. Regression and ML models were compared using measures of explained variation (area under the receiver operating characteristic curve (AUC) and R).
Among the 3138 counties assessed, the mean DM prevalence was 11.4% (range: 3.0%-21.1%). Among the 12 824 individuals assessed, 1688 met DM criteria (13.2% unweighted; 10.2% weighted). Age, gender, race/ethnicity, income, and education were associated with DM at the county and individual levels. Higher county Hispanic ethnic density was negatively associated with county DM prevalence, while Hispanic ethnicity was positively associated with individual DM. ML outperformed regression in both datasets (mean R of 0.679 vs 0.610, respectively (p<0.001) for county-level data; mean AUC of 0.737 vs 0.727 (p<0.0427) for individual-level data).
Hispanic individuals are at higher risk of DM, while counties with larger Hispanic populations have lower DM prevalence. Analyses of population-level and individual-level data with multiple methods may afford more confidence in results and identify areas for further study.
人群层面和个体层面的分析以及“黑箱”机器学习(ML)和传统的可解释模型都有其优势和局限性。糖尿病(DM)是发病和死亡的主要原因,其社会人口统计学动态复杂,尚未以利用人群层面和个体层面数据以及传统流行病学和ML模型的方式进行分析。我们使用回归和ML方法分析了互补的个体层面和县级数据集,以研究社会人口统计学因素与DM之间的关联。
县级DM患病率、人口统计学和社会经济地位(SES)因素从2018年罗伯特·伍德·约翰逊基金会县级健康排名中提取,并与美国人口普查数据合并。类似的个体层面数据从2007年至2016年国家健康和营养检查调查研究中提取,并用调查权重对过采样进行校正。我们对县级(个体)数据使用多元线性(逻辑)回归和ML回归(分类)模型。使用解释变异量度(受试者工作特征曲线下面积(AUC)和R)比较回归和ML模型。
在评估的3138个县中,DM平均患病率为11.4%(范围:3.0%-21.1%)。在评估的12824名个体中,1688人符合DM标准(未加权为13.2%;加权为10.2%)。年龄、性别、种族/民族、收入和教育在县级和个体层面与DM相关。较高的县级西班牙裔种族密度与县级DM患病率呈负相关,而西班牙裔种族与个体DM呈正相关。在两个数据集中,ML的表现均优于回归(县级数据的平均R分别为0.679对0.610(p<0.001);个体层面数据的平均AUC为0.737对0.727(p<0.0427))。
西班牙裔个体患DM的风险更高,而西班牙裔人口较多的县DM患病率较低。使用多种方法对人群层面和个体层面数据进行分析可能会使结果更具可信度,并确定进一步研究的领域。