Robert B. Willumstad School of Business, Adelphi University, Garden City, NY 11530, USA.
McDonough School of Business, Georgetown University, Washington, DC 20057, USA.
Sensors (Basel). 2022 Sep 8;22(18):6783. doi: 10.3390/s22186783.
Although lung cancer survival status and survival length predictions have primarily been studied individually, a scheme that leverages both fields in an interpretable way for physicians remains elusive. We propose a two-phase data analytic framework that is capable of classifying survival status for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points (phase I) and predicting the number of survival months within 3 years (phase II) using recent Surveillance, Epidemiology, and End Results data from 2010 to 2017. In this study, we employ three analytical models (general linear model, extreme gradient boosting, and artificial neural networks), five data balancing techniques (synthetic minority oversampling technique (SMOTE), relocating safe level SMOTE, borderline SMOTE, adaptive synthetic sampling, and majority weighted minority oversampling technique), two feature selection methods (least absolute shrinkage and selection operator (LASSO) and random forest), and the one-hot encoding approach. By implementing a comprehensive data preparation phase, we demonstrate that a computationally efficient and interpretable method such as GLM performs comparably to more complex models. Moreover, we quantify the effects of individual features in phase I and II by exploiting GLM coefficients. To the best of our knowledge, this study is the first to (a) implement a comprehensive data processing approach to develop performant, computationally efficient, and interpretable methods in comparison to black-box models, (b) visualize top factors impacting survival odds by utilizing the change in odds ratio, and (c) comprehensively explore short-term lung cancer survival using a two-phase approach.
虽然肺癌的生存状况和生存时间预测主要是单独研究的,但仍难以找到一种能够以可解释的方式利用这两个领域的方案。我们提出了一个两阶段数据分析框架,该框架能够对 0.5 年、1 年、1.5 年、2 年、2.5 年和 3 年的生存状况进行分类(第一阶段),并预测 3 年内的生存月数(第二阶段),使用的是 2010 年至 2017 年的最新监测、流行病学和最终结果数据。在这项研究中,我们使用了三种分析模型(广义线性模型、极端梯度提升和人工神经网络)、五种数据平衡技术(合成少数过采样技术(SMOTE)、重新定位安全级别 SMOTE、边界 SMOTE、自适应合成采样和多数加权少数过采样技术)、两种特征选择方法(最小绝对收缩和选择算子(LASSO)和随机森林)和一位热编码方法。通过实施全面的数据准备阶段,我们证明了像 GLM 这样的计算效率高且可解释的方法可以与更复杂的模型相媲美。此外,我们通过利用 GLM 系数在第一阶段和第二阶段量化了单个特征的影响。据我们所知,这项研究是第一个(a) 实施全面的数据处理方法来开发性能高、计算效率高且可解释的方法,与黑盒模型相比,(b) 通过利用比值比的变化来可视化影响生存几率的最重要因素,以及(c) 全面探索使用两阶段方法的短期肺癌生存情况。