Jones Sara, Beyers Matthew, Shukla Maulik, Xia Fangfang, Brettin Thomas, Stevens Rick, Weil M Ryan, Ranganathan Ganakammal Satishkumar
Frederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Rockville, MD, USA.
Argonne National Laboratory, Computing, Environment and Life Sciences, Lemont, IL, USA.
Cancer Inform. 2022 Dec 5;21:11769351221139491. doi: 10.1177/11769351221139491. eCollection 2022.
With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years.
In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models.
All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types.
We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.
癌症是全球主要死因之一,准确预测原发性肿瘤类型对于识别可抑制或减缓肿瘤进展的遗传因素至关重要。在过去几年中,人们一直在努力利用机器学习,最近则是利用深度学习,通过基因表达数据对原发性肿瘤类型进行分类。
在本文中,我们开发了四个一维(1D)卷积神经网络(CNN)模型,将RNA序列计数数据分类为17种高度代表性的原发性肿瘤类型之一或32种原发性肿瘤类型,而不考虑其不均衡的代表性。此外,我们调整模型,使其以所有Ensembl基因(60,483个)或仅蛋白质编码基因(19,758个)作为输入。与之前的工作不同,我们没有基于表达值过滤基因,从而避免了选择偏差。从基因组数据共享库(GDC)下载了来自癌症基因组图谱(TCGA)的9,025个和10,940个样本的以每百万映射读取中每千碱基转录本片段数(FPKM-UQ)表示的RNA序列计数数据,分别对应17种和32种原发性肿瘤类型,用于训练和验证模型。
所有4个1D-CNN模型在测试数据集上的总体准确率为94.7%至97.6%。进一步评估表明,对于17种和32种原发性肿瘤类型,仅以蛋白质编码基因为特征的模型比以所有Ensembl基因为特征的模型表现出更高的准确率。对于所有模型,大多数原发性肿瘤类型的按原发性肿瘤类型划分的准确率高于80%。
我们将所有4个模型打包为一个基于Python的深度学习分类工具,称为郁金香(TULIP,肿瘤分类预测器),用于对原发性肿瘤样本进行质量控制,并对未知肿瘤类型的癌症样本进行特征描述。需要对模型进行进一步优化,以提高某些原发性肿瘤类型的准确率。