Schbath Sophie, Martin Véronique, Zytnicki Matthias, Fayolle Julien, Loux Valentin, Gibrat Jean-François
INRA, UR1077 Unité Mathématique Informatique et Génome, Jouy-en-Josas, France.
J Comput Biol. 2012 Jun;19(6):796-813. doi: 10.1089/cmb.2012.0022. Epub 2012 Apr 16.
Mapping short reads against a reference genome is classically the first step of many next-generation sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits.
将短读长序列比对到参考基因组上通常是许多下一代测序数据分析的第一步,并且应该尽可能准确。由于需要处理大量的读长序列,在过去三年中已经开发了许多复杂的算法来解决这个问题。在本文中,我们首先回顾了大多数现有比对工具所使用的基础算法,然后我们在为此目的构建的一个严格控制的基准上比较了其中九种工具的性能。我们构建了一组在参考基因组中以单拷贝或多拷贝存在且无错配的读长序列,以及一组有三个错配的读长序列。我们将人类基因组和所有完整细菌基因组的串联序列都视为参考基因组。在每个数据集上,我们量化了不同工具检索参考基因组中读长序列所有出现情况的能力。特别关注了唯一报告的读长序列和有多个匹配的读长序列。