Boueiz Adel, Xu Zhonghui, Chang Yale, Masoomi Aria, Gregory Andrew, Lutz Sharon M, Qiao Dandi, Crapo James D, Dy Jennifer G, Silverman Edwin K, Castaldi Peter J
Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States.
Pulmonary and Critical Care Division, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States.
Chronic Obstr Pulm Dis. 2022 Jul 29;9(3):349-365. doi: 10.15326/jcopdf.2021.0275.
The heterogeneous nature of chronic obstructive pulmonary disease (COPD) complicates the identification of the predictors of disease progression. We aimed to improve the prediction of disease progression in COPD by using machine learning and incorporating a rich dataset of phenotypic features.
We included 4496 smokers with available data from their enrollment and 5-year follow-up visits in the COPD Genetic Epidemiology (COPDGene) study. We constructed linear regression (LR) and supervised random forest models to predict 5-year progression in forced expiratory in 1 second (FEV) from 46 baseline features. Using cross-validation, we randomly partitioned participants into training and testing samples. We also validated the results in the COPDGene 10-year follow-up visit.
Predicting the change in FEV over time is more challenging than simply predicting the future absolute FEV level. For random forest, R-squared was 0.15 and the area under the receiver operator characteristic (ROC) curves for the prediction of participants in the top quartile of observed progression was 0.71 (testing) and respectively, 0.10 and 0.70 (validation). Random forest provided slightly better performance than LR. The accuracy was best for Global initiative for chronic Obstructive Lung Disease (GOLD) grades 1-2 participants, and it was harder to achieve accurate prediction in advanced stages of the disease. Predictive variables differed in their relative importance as well as for the predictions by GOLD.
Random forest, along with deep phenotyping, predicts FEV progression with reasonable accuracy. There is significant room for improvement in future models. This prediction model facilitates the identification of smokers at increased risk for rapid disease progression. Such findings may be useful in the selection of patient populations for targeted clinical trials.
慢性阻塞性肺疾病(COPD)的异质性使得疾病进展预测指标的识别变得复杂。我们旨在通过使用机器学习并纳入丰富的表型特征数据集来改善COPD疾病进展的预测。
我们纳入了慢性阻塞性肺疾病基因流行病学(COPDGene)研究中4496名有入组数据和5年随访数据的吸烟者。我们构建了线性回归(LR)模型和监督随机森林模型,以根据46个基线特征预测1秒用力呼气容积(FEV)的5年进展情况。通过交叉验证,我们将参与者随机分为训练样本和测试样本。我们还在COPDGene研究的10年随访中验证了结果。
预测FEV随时间的变化比简单预测未来的绝对FEV水平更具挑战性。对于随机森林模型,决定系数R²为0.15,预测观察到的进展处于前四分位数的参与者时,受试者工作特征(ROC)曲线下面积在测试集中为0.71,在验证集中分别为0.10和0.70。随机森林模型的表现略优于LR模型。对于慢性阻塞性肺疾病全球倡议(GOLD)1-2级参与者,预测准确性最佳,而在疾病晚期则更难实现准确预测。预测变量的相对重要性以及按GOLD分级的预测结果各不相同。
随机森林模型结合深度表型分析,能以合理的准确性预测FEV进展情况。未来模型仍有很大改进空间。该预测模型有助于识别疾病快速进展风险增加的吸烟者。这些发现可能有助于选择适合进行靶向临床试验的患者群体。