Key Laboratory of Systems Biomedicine, Shanghai Center for Systems Biomedicine, Shanghai Jiaotong University, Shanghai, 200240, China.
School of Information Technologies, University of Sydney, Sydney, NSW, 2006, Australia.
BMC Genomics. 2018 Aug 13;19(Suppl 6):565. doi: 10.1186/s12864-018-4919-z.
With the developments of DNA sequencing technology, large amounts of sequencing data have been produced that provides unprecedented opportunities for advanced association studies between somatic mutations and cancer types/subtypes which further contributes to more accurate somatic mutation based cancer typing (SMCT). In existing SMCT methods however, the absence of high-level feature extraction is a major obstacle in improving the classification performance.
We propose DeepCNA, an advanced convolutional neural network (CNN) based classifier, which utilizes copy number aberrations (CNAs) and HiC data, to address this issue. DeepCNA first pre-process the CNA data by clipping, zero padding and reshaping. Then, the processed data is fed into a CNN classifier, which extracts high-level features for accurate classification. Experimental results on the COSMIC CNA dataset indicate that 2D CNN with both cell lines of HiC data lead to the best performance. We further compare DeepCNA with three widely adopted classifiers, and demonstrate that DeepCNA has at least 78% improvement of performance.
This paper demonstrates the advantages and potential of the proposed DeepCNA model for processing of somatic point mutation based gene data, and proposes that its usage may be extended to other complex genotype-phenotype association studies.
随着 DNA 测序技术的发展,产生了大量的测序数据,为体细胞突变与癌症类型/亚型之间的高级关联研究提供了前所未有的机会,这进一步促进了更准确的基于体细胞突变的癌症分型(SMCT)。然而,在现有的 SMCT 方法中,缺乏高级特征提取是提高分类性能的主要障碍。
我们提出了 DeepCNA,一种基于先进的卷积神经网络(CNN)的分类器,它利用拷贝数异常(CNAs)和 HiC 数据来解决这个问题。DeepCNA 首先通过裁剪、零填充和重塑来预处理 CNA 数据。然后,将处理后的数据输入到 CNN 分类器中,该分类器提取高级特征以进行准确分类。在 COSMIC CNA 数据集上的实验结果表明,使用 HiC 数据的 2D CNN 可以获得最佳性能。我们进一步将 DeepCNA 与三种广泛应用的分类器进行比较,证明 DeepCNA 的性能至少提高了 78%。
本文证明了所提出的 DeepCNA 模型在基于体细胞点突变的基因数据处理方面的优势和潜力,并提出其用途可能扩展到其他复杂的基因型-表型关联研究。