School of Pharmacy, Changsha Medical University, Changsha, 410219, People's Republic of China.
Academician Workstation, Changsha Medical University, Changsha, 410219, People's Republic of China.
Sci Rep. 2023 Sep 16;13(1):15356. doi: 10.1038/s41598-023-42465-8.
Carcinoma of unknown primary (CUP) is a type of metastatic cancer with tissue-of-origin (TOO) unidentifiable by traditional methods. CUP patients typically have poor prognosis but therapy targeting the original cancer tissue can significantly improve patients' prognosis. Thus, it's critical to develop accurate computational methods to infer cancer TOO. While qPCR or microarray-based methods are effective in inferring TOO for most cancer types, the overall prediction accuracy is yet to be improved. In this study, we propose a cross-cohort computational framework to trace TOO of 32 cancer types based on RNA sequencing (RNA-seq). Specifically, we employed logistic regression models to select 80 genes for each cancer type to create a combined 1356-gene set, based on transcriptomic data from 9911 tissue samples covering the 32 cancer types with known TOO from the Cancer Genome Atlas (TCGA). The selected genes are enriched in both tissue-specific and tissue-general functions. The cross-validation accuracy of our framework reaches 97.50% across all cancer types. Furthermore, we tested the performance of our model on the TCGA metastatic dataset and International Cancer Genome Consortium (ICGC) dataset, achieving an accuracy of 91.09% and 82.67%, respectively, despite the differences in experiment procedures and pipelines. In conclusion, we developed an accurate yet robust computational framework for identifying TOO, which holds promise for clinical applications. Our code is available at http://github.com/wangbo00129/classifybysklearn .
原发灶不明癌(Carcinoma of unknown primary,CUP)是一种转移性癌症,其组织起源(tissue-of-origin,TOO)无法通过传统方法确定。CUP 患者的预后通常较差,但针对原始癌组织的治疗可以显著改善患者的预后。因此,开发准确的计算方法来推断癌症 TOO 至关重要。虽然 qPCR 或基于微阵列的方法在推断大多数癌症类型的 TOO 方面非常有效,但总体预测准确性仍有待提高。在这项研究中,我们提出了一种基于 RNA 测序(RNA-seq)的跨队列计算框架,用于追踪 32 种癌症类型的 TOO。具体来说,我们使用逻辑回归模型为每种癌症类型选择 80 个基因,基于来自癌症基因组图谱(TCGA)的 9911 个组织样本的转录组数据,创建了一个包含 1356 个基因的综合基因集,这些样本涵盖了已知 TOO 的 32 种癌症类型。所选基因在组织特异性和组织普遍性功能中均有富集。我们的框架在所有癌症类型中的交叉验证准确率达到 97.50%。此外,我们在 TCGA 转移性数据集和国际癌症基因组联盟(ICGC)数据集上测试了我们模型的性能,分别达到了 91.09%和 82.67%的准确率,尽管实验程序和管道存在差异。总之,我们开发了一种准确而稳健的计算框架,用于识别 TOO,这为临床应用提供了希望。我们的代码可在 http://github.com/wangbo00129/classifybysklearn 上获得。