Schöneck Mirjam, Rehbach Nicolas, Lotter-Becker Lars, Persigehl Thorsten, Lennartz Simon, Caldeira Liliana Lourenco
Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany.
Life (Basel). 2025 Jan 11;15(1):83. doi: 10.3390/life15010083.
Kirsten Rat Sarcoma viral oncogene homolog (KRAS) is a frequently occurring mutation in non-small-cell lung cancer (NSCLC) and influences cancer treatment and disease progression. In this study, a machine learning (ML) pipeline was applied to radiomic features extracted from public and internal CT images to identify KRAS mutations in NSCLC patients. Both datasets were analyzed using parametric ( test) and non-parametric statistical tests (Mann-Whitney U test) and dimensionality reduction techniques. Afterwards, the proposed ML pipeline was applied to both datasets using a five-fold cross-validation on the training set (70/30 train/test split) before being validated on the other dataset. The results show that the radiomic features are significantly different (Mann-Whitney U test; < 0.05) between the two datasets, despite the use of identical feature extraction methods. Model transferability is therefore difficult to achieve, which became evident during external testing (F1 score = 0.41). Oversampling, undersampling, clustering and harmonization techniques were applied to balance and harmonize the datasets, but did not improve the classification of KRAS mutation presence. In general, due to only a single moderate result (highest test F1 score = 0.67), the accuracy of KRAS prediction is not sufficient for clinical application. In future work, the complexity of KRAS mutation might be addressed by taking submutations into consideration. Larger multicentric datasets with balanced tumor stages, including multi-scanner datasets, seem to be necessary for building robust predictive models.
Kirsten大鼠肉瘤病毒癌基因同源物(KRAS)是在非小细胞肺癌(NSCLC)中经常出现的一种突变,它会影响癌症治疗和疾病进展。在本研究中,将一种机器学习(ML)流程应用于从公开和内部CT图像中提取的放射组学特征,以识别NSCLC患者中的KRAS突变。使用参数检验和非参数统计检验(曼-惠特尼U检验)以及降维技术对这两个数据集进行了分析。之后,在训练集(70/30训练/测试分割)上使用五折交叉验证将所提出的ML流程应用于这两个数据集,然后在另一个数据集上进行验证。结果表明,尽管使用了相同的特征提取方法,但这两个数据集之间的放射组学特征仍存在显著差异(曼-惠特尼U检验;P<0.05)。因此,模型可转移性难以实现,这在外部测试期间变得很明显(F1分数=0.41)。应用了过采样、欠采样、聚类和归一化技术来平衡和归一化数据集,但并未改善KRAS突变存在情况的分类。总体而言,由于只有一个中等结果(最高测试F1分数=0.67),KRAS预测的准确性不足以用于临床应用。在未来的工作中,可能需要通过考虑亚突变来解决KRAS突变的复杂性问题。似乎需要更大的具有平衡肿瘤分期的多中心数据集,包括多扫描仪数据集,来构建强大的预测模型。