DeepSSV：使用卷积神经网络检测配对肿瘤和正常测序数据中的体细胞小变异。

DeepSSV: detecting somatic small variants in paired tumor and normal sequencing data with convolutional neural network.

机构信息

Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Suzhou, Jiangsu, China.

La Trobe University, Melbourne, Victoria, Australia.

出版信息

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa272.

DOI:10.1093/bib/bbaa272

PMID:33164053

Abstract

It is of considerable interest to detect somatic mutations in paired tumor and normal sequencing data. A number of callers that are based on statistical or machine learning approaches have been developed to detect somatic small variants. However, they take into consideration only limited information about the reference and potential variant allele in both tumor and normal samples at a candidate somatic site. Also, they differ in how biological and technological noises are addressed. Hence, they are expected to produce divergent outputs. To overcome the drawbacks of existing somatic callers, we develop a deep learning-based tool called DeepSSV, which employs a convolutional neural network (CNN) model to learn increasingly abstract feature representations from the raw data in higher feature layers. DeepSSV creates a spatially oriented representation of read alignments around the candidate somatic sites adapted for the convolutional architecture, which enables it to expand to effectively gather scattered evidence. Moreover, DeepSSV incorporates the mapping information of both reference allele-supporting and variant allele-supporting reads in the tumor and normal samples at a genomic site that are readily available in the pileup format file. Together, the CNN model can process the whole alignment information. Such representational richness allows the model to capture the dependencies in the sequence and identify context-based sequencing artifacts. We fitted the model on ground truth somatic mutations and did benchmarking experiments on simulated and real tumors. The benchmarking results demonstrate that DeepSSV outperforms its state-of-the-art competitors in overall F1 score.

摘要

检测配对肿瘤和正常测序数据中的体细胞突变具有重要意义。已经开发了许多基于统计或机器学习方法的调用者来检测体细胞小变体。然而，它们仅考虑了候选体细胞位点处肿瘤和正常样本中参考和潜在变异等位基因的有限信息。此外，它们在如何处理生物和技术噪声方面也存在差异。因此，预计它们会产生不同的输出。为了克服现有体细胞调用者的缺点，我们开发了一种基于深度学习的工具，称为 DeepSSV，它使用卷积神经网络 (CNN) 模型从原始数据中学习越来越抽象的特征表示，这些特征表示在更高的特征层中。DeepSSV 为候选体细胞位点周围的读取对齐创建了一个面向空间的表示形式，该表示形式适应卷积架构，使其能够有效地扩展以收集分散的证据。此外，DeepSSV 结合了肿瘤和正常样本中在基因组位点上易于获得的参考等位基因支持和变异等位基因支持读取的映射信息，这些信息以堆积格式文件的形式提供。CNN 模型可以处理整个对齐信息。这种表示丰富性允许模型捕获序列中的依赖关系并识别基于上下文的测序伪影。我们在真实的体细胞突变上拟合模型，并在模拟和真实肿瘤上进行基准实验。基准实验结果表明，DeepSSV 在总体 F1 评分方面优于其最先进的竞争对手。