Tiburon, CA 94920, USA and.
Department of Micro- and Nanotechnology, Technical University of Denmark, DK-2800 Lyngby, Denmark.
Bioinformatics. 2015 Nov 1;31(21):3476-82. doi: 10.1093/bioinformatics/btv401. Epub 2015 Jul 2.
Next-generation sequencing produces vast amounts of data with errors that are difficult to distinguish from true biological variation when coverage is low.
We demonstrate large reductions in error frequencies, especially for high-error-rate reads, by three independent means: (i) filtering reads according to their expected number of errors, (ii) assembling overlapping read pairs and (iii) for amplicon reads, by exploiting unique sequence abundances to perform error correction. We also show that most published paired read assemblers calculate incorrect posterior quality scores.
These methods are implemented in the USEARCH package. Binaries are freely available at http://drive5.com/usearch.
Supplementary data are available at Bioinformatics online.
当覆盖率较低时,下一代测序会产生大量数据,其中的错误很难与真实的生物变异区分开来。
我们通过三种独立的方法大大降低了错误频率,尤其是对于高错误率的读取:(i)根据预期错误数过滤读取,(ii)组装重叠的读取对,以及(iii)对于扩增子读取,利用独特的序列丰度进行错误纠正。我们还表明,大多数已发表的成对读取组装程序计算出不正确的后验质量评分。
这些方法在 USEARCH 包中实现。二进制文件可在 http://drive5.com/usearch 上免费获得。
补充数据可在 Bioinformatics 在线获得。