Stoler Nicholas, Nekrutenko Anton
Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA.
NAR Genom Bioinform. 2021 Mar 27;3(1):lqab019. doi: 10.1093/nargab/lqab019. eCollection 2021 Mar.
Sequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets. To do this, we utilized the overlaps between reads that are a feature of many sequencing libraries. With this method, we surveyed 1943 different datasets from seven different sequencing instruments produced by Illumina. We show that among public datasets, the more expensive platforms like HiSeq and NovaSeq have a lower error rate and less variation. But we also discovered that there is great variation within each platform, with the accuracy of a sequencing experiment depending greatly on the experimenter. We show the importance of sequence context, especially the phenomenon where preceding bases bias the following bases toward the same identity. We also show the difference in patterns of sequence bias between instruments. Contrary to expectations based on the underlying chemistry, HiSeq X Ten and NovaSeq 6000 share notable exceptions to the preceding-base bias. Our results demonstrate the importance of the specific circumstances of every sequencing experiment, and the importance of evaluating the quality of each one.
在过去十年中,测序技术取得了巨大进展。此前的研究已经展示了特定仪器在受控条件下的质量。在此,我们开发了一种方法,能够追溯性地确定大多数公共测序数据集的错误率。为此,我们利用了许多测序文库所具有的读段之间的重叠。通过这种方法,我们调查了Illumina生产的七种不同测序仪器的1943个不同数据集。我们表明,在公共数据集中,像HiSeq和NovaSeq这样更昂贵的平台错误率更低且变异更小。但我们也发现每个平台内部存在很大差异,测序实验的准确性在很大程度上取决于实验者。我们展示了序列上下文的重要性,尤其是前一个碱基使后一个碱基偏向相同碱基类型的现象。我们还展示了不同仪器之间序列偏向模式的差异。与基于基础化学原理的预期相反,HiSeq X Ten和NovaSeq 6000在前一个碱基偏向上存在显著例外。我们的结果证明了每个测序实验具体情况的重要性,以及评估每个实验质量的重要性。