Life Science Research and Foundation, QIAGEN Sciences Inc., Frederick, MD, USA.
Bioinformatics. 2019 Apr 15;35(8):1299-1309. doi: 10.1093/bioinformatics/bty790.
Low-frequency DNA mutations are often confounded with technical artifacts from sample preparation and sequencing. With unique molecular identifiers (UMIs), most of the sequencing errors can be corrected. However, errors before UMI tagging, such as DNA polymerase errors during end repair and the first PCR cycle, cannot be corrected with single-strand UMIs and impose fundamental limits to UMI-based variant calling.
We developed smCounter2, a UMI-based variant caller for targeted sequencing data and an upgrade from the current version of smCounter. Compared to smCounter, smCounter2 features lower detection limit that decreases from 1 to 0.5%, better overall accuracy (particularly in non-coding regions), a consistent threshold that can be applied to both deep and shallow sequencing runs, and easier use via a Docker image and code for read pre-processing. We benchmarked smCounter2 against several state-of-the-art UMI-based variant calling methods using multiple datasets and demonstrated smCounter2's superior performance in detecting somatic variants. At the core of smCounter2 is a statistical test to determine whether the allele frequency of the putative variant is significantly above the background error rate, which was carefully modeled using an independent dataset. The improved accuracy in non-coding regions was mainly achieved using novel repetitive region filters that were specifically designed for UMI data.
The entire pipeline is available at https://github.com/qiaseq/qiaseq-dna under MIT license.
Supplementary data are available at Bioinformatics online.
低频 DNA 突变通常与样本制备和测序的技术伪影混淆。使用独特的分子标识符 (UMI),可以纠正大多数测序错误。然而,UMI 标记之前的错误,如末端修复和第一轮 PCR 循环中的 DNA 聚合酶错误,不能用单链 UMI 纠正,这对基于 UMI 的变异调用施加了基本限制。
我们开发了 smCounter2,这是一种针对靶向测序数据的基于 UMI 的变异调用器,是 smCounter 的升级版本。与 smCounter 相比,smCounter2 的检测下限更低,从 1%降至 0.5%,整体准确性更高(特别是在非编码区域),一致的阈值可应用于深度和浅层测序运行,并且通过 Docker 映像和代码进行读预处理使用起来更加容易。我们使用多个数据集对 smCounter2 与几种最先进的基于 UMI 的变异调用方法进行了基准测试,并证明了 smCounter2 在检测体细胞变异方面的卓越性能。smCounter2 的核心是一种统计检验,用于确定假定变异的等位基因频率是否明显高于背景错误率,这是使用独立数据集仔细建模的。非编码区域的准确性提高主要是通过专门为 UMI 数据设计的新型重复区域过滤器实现的。
完整的管道可在 MIT 许可证下在 https://github.com/qiaseq/qiaseq-dna 上获得。
补充数据可在生物信息学在线获得。