Lu Qingfeng, Chen Fengxia, Li Qianyue, Chen Lihong, Tong Ling, Tian Geng, Zhou Xiaohong
Oncology Department, Daqing Oilfield General Hospital, Daqing, China.
Department of Thoracic Surgery, Hainan General Hospital, Haikou, China.
Front Oncol. 2022 Apr 21;12:832567. doi: 10.3389/fonc.2022.832567. eCollection 2022.
Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%-5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage.
原发灶不明的癌症(CUP)是一组异质性癌症,经传统临床方法详细检查后,其起源组织仍不明确。CUP的病例数约占所有人类恶性肿瘤的3%-5%。CUP患者通常接受广谱化疗,这往往导致预后不良。最近的研究表明,针对CUP原发灶的治疗将显著改善患者的预后。因此,在临床癌症研究中,迫切需要开发一种有效的方法来准确检测CUP的起源组织。在这项工作中,我们开发了一个新的框架,该框架使用极端梯度提升(XGBoost)基于基于微阵列的基因表达数据来追踪CUP的原发部位。首先,我们从癌症基因组图谱(TCGA)下载了5708个样本的59385个基因的基于微阵列的基因表达谱,以及从基因表达综合数据库(GEO)下载了3101个样本的6364个基因的基于微阵列的基因表达谱。这两个数据集均按4:1的比例分为训练数据和独立测试数据。然后,我们分别从TCGA和GEO数据集中在训练数据中获得200个和290个基因,以训练用于识别CUP原发部位的XGBoost模型。我们方法的总体5折交叉验证准确率在TCGA和GEO训练数据集上分别为96.9%和95.3%。同时,独立数据集在TCGA和GEO上的准确率分别达到96.75%和98.8%。实验结果表明,XGBoost框架不仅可以降低临床癌症溯源的成本,而且具有较高的效率,这可能在临床应用中有用。