Park Jimin, Cook Daniel E, Chang Pi-Chuan, Kolesnikov Alexey, Brambrink Lucas, Mier Juan Carlos, Gardner Joshua, McNulty Brandy, Sacco Samuel, Keskus Ayse, Bryant Asher, Ahmad Tanveer, Shetty Jyoti, Zhao Yongmei, Tran Bao, Narzisi Giuseppe, Helland Adrienne, Yoo Byunggil, Pushel Irina, Lansdon Lisa A, Bi Chengpeng, Walter Adam, Gibson Margaret, Pastinen Tomi, Farooqi Midhat S, Robine Nicolas, Miga Karen H, Carroll Andrew, Kolmogorov Mikhail, Paten Benedict, Shafin Kishwar
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
Google Inc, Mountain View, CA, USA.
bioRxiv. 2024 Aug 19:2024.08.16.608331. doi: 10.1101/2024.08.16.608331.
Somatic variant detection is an integral part of cancer genomics analysis. While most methods have focused on short-read sequencing, long-read technologies now offer potential advantages in terms of repeat mapping and variant phasing. We present DeepSomatic, a deep learning method for detecting somatic SNVs and insertions and deletions (indels) from both short-read and long-read data, with modes for whole-genome and exome sequencing, and able to run on tumor-normal, tumor-only, and with FFPE-prepared samples. To help address the dearth of publicly available training and benchmarking data for somatic variant detection, we generated and make openly available a dataset of five matched tumor-normal cell line pairs sequenced with Illumina, PacBio HiFi, and Oxford Nanopore Technologies, along with benchmark variant sets. Across samples and technologies (short-read and long-read), DeepSomatic consistently outperforms existing callers, particularly for indels.
体细胞变异检测是癌症基因组学分析的一个重要组成部分。虽然大多数方法都集中在短读长测序上,但长读长技术现在在重复序列映射和变异定相方面具有潜在优势。我们提出了DeepSomatic,这是一种深度学习方法,用于从短读长和长读长数据中检测体细胞单核苷酸变异(SNV)以及插入和缺失(indel),具有全基因组和外显子组测序模式,并且能够在肿瘤-正常样本、仅肿瘤样本以及福尔马林固定石蜡包埋(FFPE)制备的样本上运行。为了帮助解决体细胞变异检测方面公开可用的训练和基准测试数据匮乏的问题,我们生成并公开了一个数据集,该数据集包含五对匹配的肿瘤-正常细胞系,使用Illumina、PacBio HiFi和Oxford Nanopore Technologies进行测序,同时还提供了基准变异集。在各种样本和技术(短读长和长读长)中,DeepSomatic始终优于现有的变异检测工具,尤其是在检测indel方面。