Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, 78249, USA.
BMC Med Genomics. 2020 Apr 3;13(Suppl 5):44. doi: 10.1186/s12920-020-0677-2.
BACKGROUND: Precise prediction of cancer types is vital for cancer diagnosis and therapy. Through a predictive model, important cancer marker genes can be inferred. Several studies have attempted to build machine learning models for this task however none has taken into consideration the effects of tissue of origin that can potentially bias the identification of cancer markers. RESULTS: In this paper, we introduced several Convolutional Neural Network (CNN) models that take unstructured gene expression inputs to classify tumor and non-tumor samples into their designated cancer types or as normal. Based on different designs of gene embeddings and convolution schemes, we implemented three CNN models: 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN. The models were trained and tested on gene expression profiles from combined 10,340 samples of 33 cancer types and 713 matched normal tissues of The Cancer Genome Atlas (TCGA). Our models achieved excellent prediction accuracies (93.9-95.0%) among 34 classes (33 cancers and normal). Furthermore, we interpreted one of the models, 1D-CNN model, with a guided saliency technique and identified a total of 2090 cancer markers (108 per class on average). The concordance of differential expression of these markers between the cancer type they represent and others is confirmed. In breast cancer, for instance, our model identified well-known markers, such as GATA3 and ESR1. Finally, we extended the 1D-CNN model for the prediction of breast cancer subtypes and achieved an average accuracy of 88.42% among 5 subtypes. The codes can be found at https://github.com/chenlabgccri/CancerTypePrediction. CONCLUSIONS: Here we present novel CNN designs for accurate and simultaneous cancer/normal and cancer types prediction based on gene expression profiles, and unique model interpretation scheme to elucidate biologically relevance of cancer marker genes after eliminating the effects of tissue-of-origin. The proposed model has light hyperparameters to be trained and thus can be easily adapted to facilitate cancer diagnosis in the future.
背景:精确预测癌症类型对于癌症诊断和治疗至关重要。通过预测模型,可以推断出重要的癌症标记基因。已经有几项研究试图为此任务构建机器学习模型,但没有考虑到组织起源的影响,而组织起源可能会影响癌症标志物的识别。
结果:在本文中,我们引入了几种卷积神经网络 (CNN) 模型,这些模型采用非结构化基因表达输入,将肿瘤和非肿瘤样本分类为指定的癌症类型或正常。基于基因嵌入和卷积方案的不同设计,我们实现了三种 CNN 模型:1D-CNN、2D-Vanilla-CNN 和 2D-Hybrid-CNN。这些模型在来自癌症基因组图谱 (TCGA) 的 33 种癌症和 713 个匹配正常组织的 10340 个样本的基因表达谱上进行了训练和测试。我们的模型在 34 个类别(33 种癌症和正常)中实现了优异的预测准确性(93.9-95.0%)。此外,我们使用一种引导式显著性技术对其中一个模型(1D-CNN 模型)进行了解释,共鉴定出 2090 个癌症标记物(平均每个类别 108 个)。这些标记物在它们所代表的癌症类型和其他癌症类型之间的差异表达的一致性得到了确认。例如,在乳腺癌中,我们的模型鉴定了 GATA3 和 ESR1 等知名标记物。最后,我们扩展了 1D-CNN 模型,用于预测乳腺癌亚型,在 5 个亚型中平均准确率为 88.42%。代码可在 https://github.com/chenlabgccri/CancerTypePrediction 上找到。
结论:在这里,我们提出了基于基因表达谱的新型 CNN 设计,用于准确和同时进行癌症/正常和癌症类型预测,以及独特的模型解释方案,用于在消除组织起源影响后阐明癌症标记基因的生物学相关性。所提出的模型具有轻量级的超参数,可以进行训练,因此可以很容易地适应未来的癌症诊断。
BMC Med Genomics. 2020-4-3
BMC Med Genomics. 2020-12-28
Interdiscip Sci. 2018-12-27
Cancers (Basel). 2025-5-22
Bioengineering (Basel). 2025-1-30
Brief Bioinform. 2020-12-1
BMC Med Genomics. 2019-1-31
BMC Syst Biol. 2018-12-21
N Engl J Med. 2018-10-11
Bioinformatics. 2018-12-1
CA Cancer J Clin. 2018-1-4
Nature. 2017-10-11