Instituto de Investigación Interdisciplinaria, Vicerrectoría Académica, Universidad de Talca, 3460000, Talca, Chile.
Centre for Biotechnology and Bioengineering (CeBiB), Department of Chemical Engineering, Biotechnology and Materials, University of Chile, 8370456, Santiago, Chile.
J Transl Med. 2022 Aug 18;20(1):373. doi: 10.1186/s12967-022-03572-8.
Recently, extensive cancer genomic studies have revealed mutational and clinical data of large cohorts of cancer patients. For example, the Pan-Lung Cancer 2016 dataset (part of The Cancer Genome Atlas project), summarises the mutational and clinical profiles of different subtypes of Lung Cancer (LC). Mutational and clinical signatures have been used independently for tumour typification and prediction of metastasis in LC patients. Is it then possible to achieve better typifications and predictions when combining both data streams?
In a cohort of 1144 Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LSCC) patients, we studied the number of missense mutations (hereafter, the Total Mutational Load TML) and distribution of clinical variables, for different classes of patients. Using the TML and different sets of clinical variables (tumour stage, age, sex, smoking status, and packs of cigarettes smoked per year), we built Random Forest classification models that calculate the likelihood of developing metastasis.
We found that LC patients different in age, smoking status, and tumour type had significantly different mean TMLs. Although TML was an informative feature, its effect was secondary to the "tumour stage" feature. However, its contribution to the classification is not redundant with the latter; models trained using both TML and tumour stage performed better than models trained using only one of these variables. We found that models trained in the entire dataset (i.e., without using dimensionality reduction techniques) and without resampling achieved the highest performance, with an F1 score of 0.64 (95%CrI [0.62, 0.66]).
Clinical variables and TML should be considered together when assessing the likelihood of LC patients progressing to metastatic states, as the information these encode is not redundant. Altogether, we provide new evidence of the need for comprehensive diagnostic tools for metastasis.
最近,广泛的癌症基因组研究揭示了大量癌症患者的突变和临床数据。例如,2016 年泛肺癌数据集(癌症基因组图谱项目的一部分)总结了不同亚型肺癌(LC)的突变和临床特征。突变和临床特征已被用于肿瘤分型和预测 LC 患者的转移。那么,将这两种数据流结合起来是否可以实现更好的分型和预测?
在 1144 例肺腺癌(LUAD)和肺鳞癌(LSCC)患者的队列中,我们研究了不同类别患者的错义突变数量(以下简称总突变负荷 TML)和临床变量的分布。使用 TML 和不同的临床变量集(肿瘤分期、年龄、性别、吸烟状态和每年吸烟包数),我们构建了随机森林分类模型,计算发生转移的可能性。
我们发现年龄、吸烟状态和肿瘤类型不同的 LC 患者的平均 TML 有显著差异。虽然 TML 是一个有信息的特征,但它的作用次于“肿瘤分期”特征。然而,它对分类的贡献与后者并不冗余;使用 TML 和肿瘤分期训练的模型比仅使用其中一个变量训练的模型表现更好。我们发现,在整个数据集(即不使用降维技术)中进行训练且不进行重采样的模型表现最佳,F1 得分为 0.64(95%置信区间 [0.62, 0.66])。
在评估 LC 患者进展为转移状态的可能性时,应综合考虑临床变量和 TML,因为它们所编码的信息并不冗余。总之,我们提供了需要综合诊断工具来预测转移的新证据。