College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China.
Key Laboratory for Biomedical Engineering, Ministry of Education, Hangzhou, China.
BMC Med Inform Decis Mak. 2022 Sep 19;22(1):245. doi: 10.1186/s12911-022-01960-0.
Lung cancer is the leading cause of cancer death worldwide. Prognostic prediction plays a vital role in the decision-making process for postoperative non-small cell lung cancer (NSCLC) patients. However, the high imbalance ratio of prognostic data limits the development of effective prognostic prediction models.
In this study, we present a novel approach, namely ensemble learning with active sampling (ELAS), to tackle the imbalanced data problem in NSCLC prognostic prediction. ELAS first applies an active sampling mechanism to query the most informative samples to update the base classifier to give it a new perspective. This training process is repeated until no enough samples are queried. Next, an internal validation set is employed to evaluate the base classifiers, and the ones with the best performances are integrated as the ensemble model. Besides, we set up multiple initial training data seeds and internal validation sets to ensure the stability and generalization of the model.
We verified the effectiveness of the ELAS on a real clinical dataset containing 1848 postoperative NSCLC patients. Experimental results showed that the ELAS achieved the best averaged 0.736 AUROC value and 0.453 AUPRC value for 6 prognostic tasks and obtained significant improvements in comparison with the SVM, AdaBoost, Bagging, SMOTE and TomekLinks.
We conclude that the ELAS can effectively alleviate the imbalanced data problem in NSCLC prognostic prediction and demonstrates good potential for future postoperative NSCLC prognostic prediction.
肺癌是全球癌症死亡的主要原因。预后预测在非小细胞肺癌(NSCLC)术后患者的决策过程中起着至关重要的作用。然而,预后数据的高不平衡比限制了有效预后预测模型的发展。
在这项研究中,我们提出了一种新的方法,即集成学习与主动采样(ELAS),以解决 NSCLC 预后预测中的不平衡数据问题。ELAS 首先应用主动采样机制查询最具信息量的样本,以更新基础分类器,为其提供新的视角。这个训练过程会一直重复,直到没有足够的样本被查询。接下来,使用内部验证集来评估基础分类器,选择表现最好的分类器进行集成作为集成模型。此外,我们设置了多个初始训练数据种子和内部验证集,以确保模型的稳定性和泛化能力。
我们在包含 1848 例术后 NSCLC 患者的真实临床数据集上验证了 ELAS 的有效性。实验结果表明,ELAS 在 6 项预后任务中实现了最佳的平均 0.736 AUROC 值和 0.453 AUPRC 值,与 SVM、AdaBoost、Bagging、SMOTE 和 TomekLinks 相比,均有显著提高。
我们得出结论,ELAS 可以有效地缓解 NSCLC 预后预测中的不平衡数据问题,并展示了在未来术后 NSCLC 预后预测中的良好潜力。