He Binsheng, Lang Jidong, Wang Bo, Liu Xiaojun, Lu Qingqing, He Jianjun, Gao Wei, Bing Pingping, Tian Geng, Yang Jialiang
Academician Workstation, Changsha Medical University, Changsha, China.
Geneis Beijing Co., Ltd., Beijing, China.
Front Bioeng Biotechnol. 2020 May 19;8:394. doi: 10.3389/fbioe.2020.00394. eCollection 2020.
Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the accumulation of big cancer data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it is now feasible to predict cancer tissue-of-origin by computational tools. Metastatic tumor inherits characteristics from its tissue-of-origin, and both gene expression profile and somatic mutation have tissue specificity. Thus, we developed a computational framework to infer tumor tissue-of-origin by integrating both gene mutation and expression (TOOme). Specifically, we first perform feature selection on both gene expressions and mutations by a random forest method. The selected features are then used to build up a multi-label classification model to infer cancer tissue-of-origin. We adopt a few popular multiple-label classification methods, which are compared by the 10-fold cross validation process. We applied TOOme to the TCGA data containing 7,008 non-metastatic samples across 20 solid tumors. Seventy four genes by gene expression profile and six genes by gene mutation are selected by the random forest process, which can be divided into two categories: (1) cancer type specific genes and (2) those expressed or mutated in several cancers with different levels of expression or mutation rates. Function analysis indicates that the selected genes are significantly enriched in gland development, urogenital system development, hormone metabolic process, thyroid hormone generation prostate hormone generation and so on. According to the multiple-label classification method, random forest performs the best with a 10-fold cross-validation prediction accuracy of 96%. We also use the 19 metastatic samples from TCGA and 256 cancer samples downloaded from GEO as independent testing data, for which TOOme achieves a prediction accuracy of 89%. The cross-validation validation accuracy is better than those using gene expression (i.e., 95%) and gene mutation (53%) alone. In conclusion, TOOme provides a quick yet accurate alternative to traditional medical methods in inferring cancer tissue-of-origin. In addition, the methods combining somatic mutation and gene expressions outperform those using gene expression or mutation alone.
转移性癌症需要进一步诊断以确定其原发肿瘤部位。然而,根据美国的一项统计数据,约5%的肿瘤无法通过常规医学诊断确定其组织来源。随着机器学习技术的发展以及来自癌症基因组图谱(TCGA)和基因表达综合数据库(GEO)的大量癌症数据的积累,现在通过计算工具预测癌症组织来源是可行的。转移性肿瘤继承了其组织来源的特征,基因表达谱和体细胞突变都具有组织特异性。因此,我们开发了一个计算框架,通过整合基因突变和表达来推断肿瘤组织来源(TOOme)。具体而言,我们首先通过随机森林方法对基因表达和突变进行特征选择。然后将所选特征用于构建多标签分类模型以推断癌症组织来源。我们采用了几种流行的多标签分类方法,并通过10折交叉验证过程进行比较。我们将TOOme应用于包含20种实体瘤的7008个非转移性样本的TCGA数据。通过随机森林过程选择了74个基于基因表达谱的基因和6个基于基因突变的基因,这些基因可分为两类:(1)癌症类型特异性基因;(2)在几种癌症中表达或突变且表达或突变水平不同的基因。功能分析表明,所选基因在腺体发育、泌尿生殖系统发育、激素代谢过程、甲状腺激素生成、前列腺激素生成等方面显著富集。根据多标签分类方法,随机森林表现最佳,10折交叉验证预测准确率为96%。我们还将来自TCGA的19个转移性样本和从GEO下载的256个癌症样本用作独立测试数据,TOOme对其预测准确率达到89%。交叉验证准确率优于单独使用基因表达(即95%)和基因突变(53%)的情况。总之,TOOme在推断癌症组织来源方面为传统医学方法提供了一种快速且准确的替代方法。此外,结合体细胞突变和基因表达的方法优于单独使用基因表达或突变的方法。