Data Science Institute, NUI Galway, Galway, Ireland.
Insight Centre for Data Analytics, NUI Galway, Galway, Ireland.
AMIA Annu Symp Proc. 2023 Apr 29;2022:1062-1071. eCollection 2022.
Early-stage lung cancer is crucial clinically due to its insidious nature and rapid progression. Most of the prediction models designed to predict tumour recurrence in the early stage of lung cancer rely on the clinical or medical history of the patient. However, their performance could likely be improved if the input patient data contained genomic information. Unfortunately, such data is not always collected. This is the main motivation of our work, in which we have imputed and integrated specific type of genomic data with clinical data to increase the accuracy of machine learning models for prediction of relapse in early-stage, non-small cell lung cancer patients. Using a publicly available TCGA lung adenocarcinoma cohort of 501 patients, their aneuploidy scores were imputed into similar records in the Spanish Lung Cancer Group (SLCG) data, more specifically a cohort of 1348 early-stage patients. First, the tumor recurrence in those patients was predicted without the imputed aneuploidy scores. Then, the SLCG data were enriched with the aneuploidy scores imputed from TCGA. This integrative approach improved the prediction of the relapse risk, achieving area under the precision-recall curve (PR-AUC) score of 0.74, and area under the ROC (ROC-AUC) score of 0.79. Using the prediction explanation model SHAP (SHapley Additive exPlanations), we further explained the predictions performed by the machine learning model. We conclude that our explainable predictive model is a promising tool for oncologists that addresses an unmet clinical need of post-treatment patient stratification based on the relapse risk, while also improving the predictive power by incorporating proxy genomic data not available for the actual specific patients.
早期肺癌由于其隐匿性和快速进展的特点,在临床上至关重要。大多数旨在预测早期肺癌肿瘤复发的预测模型都依赖于患者的临床或病史。然而,如果输入的患者数据包含基因组信息,它们的性能可能会得到提高。不幸的是,并非总是会收集此类数据。这就是我们工作的主要动机,我们已经对特定类型的基因组数据进行了推断和整合,并与临床数据相结合,以提高机器学习模型预测早期非小细胞肺癌患者复发的准确性。使用公共可用的 TCGA 肺腺癌队列中的 501 名患者,将他们的非整倍性评分推断到西班牙肺癌组(SLCG)数据中的相似记录中,更具体地说是 1348 名早期患者的队列中。首先,在没有推断出的非整倍性评分的情况下,预测了这些患者的肿瘤复发情况。然后,用 TCGA 推断的非整倍性评分丰富了 SLCG 数据。这种综合方法提高了复发风险的预测,实现了精度召回曲线下面积(PR-AUC)评分 0.74 和 ROC 曲线下面积(ROC-AUC)评分 0.79。使用可解释性预测模型 SHAP(SHapley Additive exPlanations),我们进一步解释了机器学习模型的预测。我们得出结论,我们的可解释预测模型是肿瘤学家的一种有前途的工具,它满足了基于复发风险对治疗后患者进行分层的未满足的临床需求,同时通过整合实际特定患者不可用的代理基因组数据来提高预测能力。