Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA.
Department of Computer Science, University of Memphis, Memphis, TN, USA.
Genome Biol. 2021 Jan 25;22(1):37. doi: 10.1186/s13059-020-02254-2.
There is currently no method to precisely measure the errors that occur in the sequencing instrument/sequencer, which is critical for next-generation sequencing applications aimed at discovering the genetic makeup of heterogeneous cellular populations.
We propose a novel computational method, SequencErr, to address this challenge by measuring the base correspondence between overlapping regions in forward and reverse reads. An analysis of 3777 public datasets from 75 research institutions in 18 countries revealed the sequencer error rate to be ~ 10 per million (pm) and 1.4% of sequencers and 2.7% of flow cells have error rates > 100 pm. At the flow cell level, error rates are elevated in the bottom surfaces and > 90% of HiSeq and NovaSeq flow cells have at least one outlier error-prone tile. By sequencing a common DNA library on different sequencers, we demonstrate that sequencers with high error rates have reduced overall sequencing accuracy, and removal of outlier error-prone tiles improves sequencing accuracy. We demonstrate that SequencErr can reveal novel insights relative to the popular quality control method FastQC and achieve a 10-fold lower error rate than popular error correction methods including Lighter and Musket.
Our study reveals novel insights into the nature of DNA sequencing errors incurred on DNA sequencers. Our method can be used to assess, calibrate, and monitor sequencer accuracy, and to computationally suppress sequencer errors in existing datasets.
目前尚无精确测量测序仪器/测序仪错误的方法,这对于旨在发现异质细胞群体遗传构成的下一代测序应用至关重要。
我们提出了一种新的计算方法 SequencErr,通过测量正向和反向读取的重叠区域之间的碱基对应关系来解决这一挑战。对来自 18 个国家的 75 个研究机构的 3777 个公共数据集的分析表明,测序仪错误率约为每百万 10 个(pm),1.4%的测序仪和 2.7%的流动池错误率>100 pm。在流动池层面,底部表面的错误率较高,超过 90%的 HiSeq 和 NovaSeq 流动池至少有一个易出错的异常点。通过在不同的测序仪上对常见的 DNA 文库进行测序,我们证明了具有高错误率的测序仪整体测序准确性降低,并且去除异常易出错的点可以提高测序准确性。我们证明了 SequencErr 可以揭示相对于流行的质量控制方法 FastQC 的新见解,并实现比流行的纠错方法包括 Lighter 和 Musket 低 10 倍的错误率。
我们的研究揭示了 DNA 测序仪上 DNA 测序错误的性质的新见解。我们的方法可用于评估、校准和监测测序仪的准确性,并在现有数据集中计算抑制测序仪错误。