Kamis Arnold, Gadia Nidhi, Luo Zilin, Ng Shu Xin, Thumbar Mansi
Brandeis International Business School, Brandeis University, Waltham, MA, United States.
JMIR AI. 2024 Aug 29;3:e58455. doi: 10.2196/58455.
Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.
We gathered a diverse set of non-personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.
We integrated non-personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.
The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.
This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.
肺部疾病在美国是一个严重问题。尽管吸烟率在下降,但慢性阻塞性肺疾病(COPD)在美国仍然是一项健康负担。在本文中,我们聚焦于2016年至2019年美国的慢性阻塞性肺疾病。
我们从公共数据源收集了一系列不同的不可识别个人身份的信息,以更好地理解和预测美国基于核心统计区(CBSA)层面的慢性阻塞性肺疾病发病率。我们的目标是比较线性模型和机器学习模型,以获得最准确且可解释的慢性阻塞性肺疾病模型。
我们整合了来自多个疾病控制与预防中心来源的不可识别个人身份的信息,并使用不同类型的方法对慢性阻塞性肺疾病进行分析。我们纳入了吸烟这一众所周知的影响因素,以及种族/民族,因为美国不同种族和民族之间的健康差异也是广为人知的。模型还包括空气质量指数、教育程度、就业情况和经济变量。我们使用多元线性回归和机器学习方法拟合模型。
最准确的多元线性回归模型的方差解释率为81.1%,平均绝对误差为0.591,对称平均绝对百分比误差为9.666。最准确的机器学习模型的方差解释率为85.7%,平均绝对误差为0.456,对称平均绝对百分比误差为6.956。总体而言,吸烟和家庭收入是最强的预测变量。中等强度的预测因素包括教育水平和失业水平,以及美国印第安人或阿拉斯加原住民、黑人以及西班牙裔人口百分比,所有这些均在CBSA层面进行衡量。
本研究强调了使用多样化数据源以及多种方法来理解和预测慢性阻塞性肺疾病的重要性。最准确的模型是梯度提升树,它在一个准确性优于最佳多元线性回归的模型中捕捉到了非线性关系。我们的可解释模型提出了在针对特定人口统计学和人种学社区降低慢性阻塞性肺疾病发病率的定制干预措施中使用各个预测变量的方法。在理解空气质量差对健康的影响方面存在差距,尤其是与气候变化相关的影响,这表明需要进一步开展研究以设计干预措施并改善公共卫生。