Ramirez Ricardo, Chiu Yu-Chiao, Hererra Allen, Mostavi Milad, Ramirez Joshua, Chen Yidong, Huang Yufei, Jin Yu-Fang
Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, Texas 78249, USA.
Greehey Children's Cancer Research Institute, The University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
Front Phys. 2020 Jun;8. doi: 10.3389/fphy.2020.00203. Epub 2020 Jun 17.
BACKGROUND: Cancer has been a leading cause of death in the United States with significant health care costs. Accurate prediction of cancers at an early stage and understanding the genomic mechanisms that drive cancer development are vital to the improvement of treatment outcomes and survival rates, thus resulting in significant social and economic impacts. Attempts have been made to classify cancer types with machine learning techniques during the past two decades and deep learning approaches more recently. RESULTS: In this paper, we established four models with graph convolutional neural network (GCNN) that use unstructured gene expressions as inputs to classify different tumor and non-tumor samples into their designated 33 cancer types or as normal. Four GCNN models based on a co-expression graph, co-expression+singleton graph, protein-protein interaction (PPI) graph, and PPI+singleton graph have been designed and implemented. They were trained and tested on combined 10,340 cancer samples and 731 normal tissue samples from The Cancer Genome Atlas (TCGA) dataset. The established GCNN models achieved excellent prediction accuracies (89.9-94.7%) among 34 classes (33 cancer types and a normal group). gene-perturbation experiments were performed on four models based on co-expression graph, co-expression+singleton, PPI graph, and PPI+singleton graphs. The co-expression GCNN model was further interpreted to identify a total of 428 markers genes that drive the classification of 33 cancer types and normal. The concordance of differential expressions of these markers between the represented cancer type and others are confirmed. Successful classification of cancer types and a normal group regardless of normal tissues' origin suggested that the identified markers are cancer-specific rather than tissue-specific. CONCLUSION: Novel GCNN models have been established to predict cancer types or normal tissue based on gene expression profiles. We demonstrated the results from the TCGA dataset that these models can produce accurate classification (above 94%), using cancer-specific markers genes. The models and the source codes are publicly available and can be readily adapted to the diagnosis of cancer and other diseases by the data-driven modeling research community.
背景:癌症一直是美国主要的死因之一,医疗成本高昂。早期准确预测癌症并了解驱动癌症发展的基因组机制对于改善治疗效果和生存率至关重要,从而产生重大的社会和经济影响。在过去二十年中,人们尝试使用机器学习技术对癌症类型进行分类,最近又采用了深度学习方法。 结果:在本文中,我们建立了四个基于图卷积神经网络(GCNN)的模型,这些模型使用非结构化基因表达作为输入,将不同的肿瘤和非肿瘤样本分类为指定的33种癌症类型或正常样本。设计并实现了基于共表达图、共表达+单例图、蛋白质-蛋白质相互作用(PPI)图和PPI+单例图的四个GCNN模型。它们在来自癌症基因组图谱(TCGA)数据集的10340个癌症样本和731个正常组织样本的组合上进行了训练和测试。所建立的GCNN模型在34个类别(33种癌症类型和一个正常组)中实现了优异的预测准确率(89.9 - 94.7%)。对基于共表达图、共表达+单例图、PPI图和PPI+单例图的四个模型进行了基因扰动实验。对共表达GCNN模型进行了进一步解释,以识别总共428个驱动33种癌症类型和正常样本分类的标记基因。证实了这些标记在代表性癌症类型与其他类型之间差异表达的一致性。无论正常组织的来源如何,都成功地对癌症类型和正常组进行了分类,这表明所识别的标记是癌症特异性的而非组织特异性的。 结论:已经建立了新颖的GCNN模型,用于基于基因表达谱预测癌症类型或正常组织。我们展示了来自TCGA数据集的结果,即这些模型可以使用癌症特异性标记基因产生准确的分类(超过94%)。这些模型和源代码是公开可用的,数据驱动的建模研究社区可以很容易地将其应用于癌症和其他疾病的诊断。
Psychopharmacol Bull. 2024-7-8
Health Technol Assess. 2006-9
2025-1
IEEE Trans Autom Sci Eng. 2025
Brief Bioinform. 2025-7-2
Epigenetics Chromatin. 2025-6-14
BMC Bioinformatics. 2025-6-4
Patterns (N Y). 2025-3-14
Bioinform Adv. 2024-12-18
Nat Rev Genet. 2025-5
BMC Med Genomics. 2020-4-3
Wiley Interdiscip Rev Data Min Knowl Discov. 2019
Stud Health Technol Inform. 2019-9-3
BMC Med Genomics. 2019-1-31
CA Cancer J Clin. 2019-1-8
BMC Syst Biol. 2018-12-21
BMC Bioinformatics. 2017-2-28
BMC Res Notes. 2017-1-19