Muro Shigeo, Ishida Masato, Horie Yoshiharu, Takeuchi Wataru, Nakagawa Shunki, Ban Hideyuki, Nakagawa Tohru, Kitamura Tetsuhisa
Department of Respiratory Medicine, Nara Medical University, Nara, Japan.
Department of Respiratory and Immunology, Medical, AstraZeneca KK, Osaka, Japan.
JMIR Med Inform. 2021 Jul 6;9(7):e24796. doi: 10.2196/24796.
Airflow limitation is a critical physiological feature in chronic obstructive pulmonary disease (COPD), for which long-term exposure to noxious substances, including tobacco smoke, is an established risk. However, not all long-term smokers develop COPD, meaning that other risk factors exist.
This study aimed to predict the risk factors for COPD diagnosis using machine learning in an annual medical check-up database.
In this retrospective observational cohort study (ARTDECO [Analysis of Risk Factors to Detect COPD]), annual medical check-up records for all Hitachi Ltd employees in Japan collected from April 1998 to March 2019 were analyzed. Employees who provided informed consent via an opt-out model were screened and those aged 30 to 75 years without a prior diagnosis of COPD/asthma or a history of cancer were included. The database included clinical measurements (eg, pulmonary function tests) and questionnaire responses. To predict the risk factors for COPD diagnosis within a 3-year period, the Gradient Boosting Decision Tree machine learning (XGBoost) method was applied as a primary approach, with logistic regression as a secondary method. A diagnosis of COPD was made when the ratio of the prebronchodilator forced expiratory volume in 1 second (FEV) to prebronchodilator forced vital capacity (FVC) was <0.7 during two consecutive examinations.
Of the 26,101 individuals screened, 1213 met the exclusion criteria, and thus, 24,815 individuals were included in the analysis. The top 10 predictors for COPD diagnosis were FEV/FVC, smoking status, allergic symptoms, cough, pack years, hemoglobin A, serum albumin, mean corpuscular volume, percent predicted vital capacity, and percent predicted value of FEV. The areas under the receiver operating characteristic curves of the XGBoost model and the logistic regression model were 0.956 and 0.943, respectively.
Using a machine learning model in this longitudinal database, we identified a number of parameters as risk factors other than smoking exposure or lung function to support general practitioners and occupational health physicians to predict the development of COPD. Further research to confirm our results is warranted, as our analysis involved a database used only in Japan.
气流受限是慢性阻塞性肺疾病(COPD)的关键生理特征,长期接触包括烟草烟雾在内的有害物质是公认的风险因素。然而,并非所有长期吸烟者都会患上COPD,这意味着还存在其他风险因素。
本研究旨在利用机器学习方法,在年度体检数据库中预测COPD诊断的风险因素。
在这项回顾性观察队列研究(ARTDECO [检测COPD的风险因素分析])中,分析了1998年4月至2019年3月收集的所有日立公司日本员工的年度体检记录。通过退出模型提供知情同意的员工经过筛选,纳入年龄在30至75岁之间、既往未诊断为COPD/哮喘或无癌症病史的员工。数据库包括临床测量数据(如肺功能测试)和问卷调查回复。为了预测3年内COPD诊断的风险因素,主要采用梯度提升决策树机器学习(XGBoost)方法,逻辑回归作为次要方法。当两次连续检查时,支气管扩张剂使用前1秒用力呼气量(FEV)与支气管扩张剂使用前用力肺活量(FVC)的比值<0.7时,诊断为COPD。
在筛选的26101名个体中,1213名符合排除标准,因此,24815名个体纳入分析。COPD诊断的前10个预测因素是FEV/FVC、吸烟状况、过敏症状、咳嗽、吸烟包年数、血红蛋白A、血清白蛋白、平均红细胞体积、预测肺活量百分比和FEV预测值百分比。XGBoost模型和逻辑回归模型的受试者工作特征曲线下面积分别为0.956和0.943。
在这个纵向数据库中使用机器学习模型,我们确定了一些除吸烟暴露或肺功能之外的参数作为风险因素,以支持全科医生和职业健康医生预测COPD的发生。由于我们的分析涉及仅在日本使用的数据库,因此有必要进行进一步研究以证实我们的结果。