Suppr超能文献

两阶段建模方法用于乳腺癌生存预测。

A two-stage modeling approach for breast cancer survivability prediction.

机构信息

Department of Decision Sciences, Adelphi University, Garden City, NY 11530, USA.

College of Nursing and Public Health, Adelphi University, Garden City, NY 11530, USA.

出版信息

Int J Med Inform. 2021 May;149:104438. doi: 10.1016/j.ijmedinf.2021.104438. Epub 2021 Mar 11.

Abstract

BACKGROUND

Despite the increasing number of studies in breast cancer survival prediction, there is little attention put toward deceased patients and their survival lengths. Moreover, developing a model that is both accurate and interpretable remains a challenge.

OBJECTIVE

This paper proposes a two-stage data analytic framework, where Stage I classifies the survival and deceased statuses and Stage II predicts the number of survival months for deceased females with cancer. Since medical data are not entirely clean nor prepared for model development, we aim to show that data preparation can strengthen a simple Generalized Linear Model (GLM) to predict as accurate as the complex models like Extreme Gradient Boosting (XGB) and Multilayer Perceptron based on Artificial Neural Networks (MLP-ANNs) in both stages.

METHODS

In Stage I, we use recent Surveillance, Epidemiology, and End Results (SEER) data from 2004 to 2016 to predict short term survival statuses from 6-months to 3-years with 6 month increments. Synthetic Minority Over-sampling Technique (SMOTE), Relocating Safe-Level SMOTE (RSLS), Adaptive Synthetic (ADASYN) re-sampling techniques, Least Absolute Shrinkage and Selection Operator (LASSO) and Random Forest (RF) feature selection methods along with integer and one-hot encoding are combined with the three popular data mining methods: GLM, XGB, and MLP. In Stage II, we predict the number of survival months for patients who are correctly predicted as deceased within 3-years. Again, we employ GLM, XGB, and MLP for regression along with LASSO and RF for feature selection and one-hot encoding to encode the categorical features.

RESULTS

We obtain Area Under the Receiver Operating Characteristic Curve (AUC) values of 0.900, 0.898, 0.877, 0.852, 0.852, and 0.858 for 6-month, 1-, 1.5-, 2-, 2.5, and 3-year survival time-points, respectively, using OneHotEncoding-GLM-LASSO-ADASYN. We use the change in the Odds Ratio values in GLM to manifest the impact of individual categorical levels and numerical features on the odds of death. In Stage II, we obtain Mean Absolute Error (MAE) of 7.960 months using OneHotEncoding-GLM-LASSO when predicting the number of survival months for deceased patients. We present the top contributing features and their coefficient values to illustrate how the presence of each feature alters the predicted number of survival months.

CONCLUSION

To the best of our knowledge, this is the first study that implements both breast cancer survival classification and regression in a two-stage approach. All data-driven findings are presented in order to assist clinicians make better care decisions using GLM, an interpretable and computationally efficient method that predicts survival status and survival lengths for deceased patients, to help foster human and machine interactions.

摘要

背景

尽管越来越多的研究关注乳腺癌的生存预测,但对于已故患者及其生存时间的研究却很少。此外,开发一个准确且可解释的模型仍然是一个挑战。

目的

本文提出了一个两阶段数据分析框架,其中第一阶段对生存和死亡状态进行分类,第二阶段预测癌症死亡女性的生存月数。由于医学数据不完全干净,也没有为模型开发做好准备,我们的目标是表明,在两个阶段中,数据准备都可以增强简单的广义线性模型(GLM),使其预测结果与复杂模型(如极端梯度提升(XGB)和基于人工神经网络的多层感知机(MLP-ANNs))一样准确。

方法

在第一阶段,我们使用最近的 2004 年至 2016 年监测、流行病学和最终结果(SEER)数据,以 6 个月为增量,预测 6 个月至 3 年的短期生存状态。我们结合了最近的合成少数过采样技术(SMOTE)、安全级别重定位 SMOTE(RSLS)、自适应合成(ADASYN)重采样技术、最小绝对收缩和选择算子(LASSO)和随机森林(RF)特征选择方法,以及整数和独热编码,与三种流行的数据挖掘方法:GLM、XGB 和 MLP 结合使用。在第二阶段,我们预测在 3 年内被正确预测为死亡的患者的生存月数。同样,我们使用 GLM、XGB 和 MLP 进行回归,并使用 LASSO 和 RF 进行特征选择和独热编码,对分类特征进行编码。

结果

使用 OneHotEncoding-GLM-LASSO-ADASYN,我们分别获得了 6 个月、1 个月、1.5 个月、2 个月、2.5 个月和 3 年生存时间点的接收者操作特征曲线(ROC)下面积(AUC)值为 0.900、0.898、0.877、0.852、0.852 和 0.858。我们使用 GLM 中的优势比(OR)值变化来显示各个分类水平和数值特征对死亡概率的影响。在第二阶段,我们使用 OneHotEncoding-GLM-LASSO 预测死亡患者的生存月数,得到平均绝对误差(MAE)为 7.960 个月。我们呈现了前几个有贡献的特征及其系数值,以说明每个特征的存在如何改变预测的生存月数。

结论

据我们所知,这是第一个在两阶段方法中同时实施乳腺癌生存分类和回归的研究。为了帮助促进人机交互,我们展示了所有数据驱动的发现,以便临床医生使用可解释且计算效率高的 GLM 做出更好的护理决策,该模型可预测生存状态和死亡患者的生存时间。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验