Yuan Yuchen, Shi Yi, Li Changyang, Kim Jinman, Cai Weidong, Han Zeguang, Feng David Dagan
School of Information Technologies, The University of Sydney, Darlington, NSW, 2008, Australia.
Key Laboratory of Systems Biomedicine, Shanghai Center for Systems Biomedicine, Shanghai Jiaotong University, Shanghai, 200240, China.
BMC Bioinformatics. 2016 Dec 23;17(Suppl 17):476. doi: 10.1186/s12859-016-1334-9.
With the developments of DNA sequencing technology, large amounts of sequencing data have become available in recent years and provide unprecedented opportunities for advanced association studies between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in improving the classification performance.
To address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy.
Based on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high level features between combinatorial somatic point mutations and cancer types.
随着DNA测序技术的发展,近年来大量测序数据得以获取,为体细胞点突变与癌症类型/亚型之间的高级关联研究提供了前所未有的机会,这可能有助于实现基于体细胞点突变的更准确癌症分类(SMCC)。然而,在现有的SMCC方法中,诸如高数据稀疏性、小样本量以及简单线性分类器的应用等问题,是提高分类性能的主要障碍。
为解决现有SMCC研究中的障碍,我们提出了DeepGene,一种基于深度神经网络(DNN)的先进分类器,它由三个步骤组成:首先,聚类基因过滤(CGF)通过突变发生频率集中基因数据,滤除大多数无关基因;其次,索引稀疏性降低(ISR)将基因数据转换为其非零元素的索引,从而显著抑制数据稀疏性的影响;最后,将经过CGF和ISR处理的数据输入到DNN分类器中,该分类器提取高级特征以进行准确分类。在我们精心整理的TCGA - DeepGene数据集上的实验结果表明,CGF、ISR和DNN都有助于提高整体分类性能。该数据集是TCGA数据集的重新整理子集,包含12种选定的癌症类型。我们进一步将DeepGene与三种广泛采用的分类器进行比较,并证明DeepGene在测试准确率方面至少有24%的性能提升。
基于深度学习和体细胞点突变数据,我们设计了DeepGene,一种先进的癌症类型分类器,它解决了现有SMCC研究中的障碍。实验表明,DeepGene优于三种广泛采用的现有分类器,这主要归功于其深度学习模块能够提取组合体细胞点突变与癌症类型之间的高级特征。