Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan.
Nucleic Acids Res. 2010 Apr;38(7):e100. doi: 10.1093/nar/gkq010. Epub 2010 Jan 27.
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
新的 DNA 测序技术在通量方面取得了突破,但其代价是更高的错误率。解释生物序列的主要方法是通过比对,但标准的比对方法假设序列是准确的。在这里,我们描述了如何将测序仪报告的每个碱基的错误概率纳入比对中。与现有的 DNA 读取映射工具不同,我们的方法同时考虑了测序仪错误和真实序列差异。即使真实序列差异率仅为 0.2%,这种方法也能始终提高映射准确性。此外,当将果蝇的读取映射到果蝇 simulans 基因组时,它将正确映射的读取量从 49 增加到 66%。这种方法可以更有效地利用缺乏参考基因组、已灭绝或高度多态性的生物体的 DNA 读取。