Xie Chenyi, Du Richard, Ho Joshua Wk, Pang Herbert H, Chiu Keith Wh, Lee Elaine Yp, Vardhanabhuti Varut
Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Queen Mary Hospital, Hong Kong SAR, China.
School of Biomedical Science, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China.
Eur J Nucl Med Mol Imaging. 2020 Nov;47(12):2826-2835. doi: 10.1007/s00259-020-04756-4. Epub 2020 Apr 6.
Biomedical data frequently contain imbalance characteristics which make achieving good predictive performance with data-driven machine learning approaches a challenging task. In this study, we investigated the impact of re-sampling techniques for imbalanced datasets in PET radiomics-based prognostication model in head and neck (HNC) cancer patients.
Radiomics analysis was performed in two cohorts of patients, including 166 patients newly diagnosed with nasopharyngeal carcinoma (NPC) in our centre and 182 HNC patients from open database. Conventional PET parameters and robust radiomics features were extracted for correlation analysis of the overall survival (OS) and disease progression-free survival (DFS). We investigated a cross-combination of 10 re-sampling methods (oversampling, undersampling, and hybrid sampling) with 4 machine learning classifiers for survival prediction. Diagnostic performance was assessed in hold-out test sets. Statistical differences were analysed using Monte Carlo cross-validations by post hoc Nemenyi analysis.
Oversampling techniques like ADASYN and SMOTE could improve prediction performance in terms of G-mean and F-measures in minority class, without significant loss of F-measures in majority class. We identified optimal PET radiomics-based prediction model of OS (AUC of 0.82, G-mean of 0.77) for our NPC cohort. Similar findings that oversampling techniques improved the prediction performance were seen when this was tested on an external dataset indicating generalisability.
Our study showed a significant positive impact on the prediction performance in imbalanced datasets by applying re-sampling techniques. We have created an open-source solution for automated calculations and comparisons of multiple re-sampling techniques and machine learning classifiers for easy replication in future studies.
生物医学数据常常具有不平衡特征,这使得使用数据驱动的机器学习方法实现良好的预测性能成为一项具有挑战性的任务。在本研究中,我们调查了重采样技术对基于PET影像组学的头颈(HNC)癌患者预后模型中不平衡数据集的影响。
对两组患者进行了影像组学分析,包括我们中心新诊断的166例鼻咽癌(NPC)患者和来自开放数据库的182例HNC患者。提取常规PET参数和稳健的影像组学特征,用于总生存期(OS)和无疾病进展生存期(DFS)的相关性分析。我们研究了10种重采样方法(过采样、欠采样和混合采样)与4种机器学习分类器的交叉组合用于生存预测。在留出测试集中评估诊断性能。使用事后Nemenyi分析的蒙特卡罗交叉验证分析统计差异。
像ADASYN和SMOTE这样的过采样技术可以在少数类别的G均值和F度量方面提高预测性能,而多数类别的F度量不会有显著损失。我们为我们的NPC队列确定了基于PET影像组学的最佳OS预测模型(AUC为0.82,G均值为0.77)。在外部数据集上进行测试时也发现了过采样技术提高预测性能的类似结果,表明具有通用性。
我们的研究表明,应用重采样技术对不平衡数据集中的预测性能有显著的积极影响。我们创建了一个开源解决方案,用于自动计算和比较多种重采样技术和机器学习分类器,以便在未来的研究中易于复制。