Menzel Peter, Frellsen Jes, Plass Mireya, Rasmussen Simon H, Krogh Anders
Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark.
Methods Mol Biol. 2013;1038:39-59. doi: 10.1007/978-1-62703-514-9_3.
The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.
高通量测序技术的发展彻底改变了我们研究基因组和基因调控的方式。在单次实验中,会产生数百万条读数。为了从这些实验中获取知识,首先要做的是确定读数的基因组来源,即把读数映射到参考基因组上。在这种新情况下,传统的比对工具已过时,因为它们无法在合理时间内处理如此大量的数据。因此,已开发出新型映射算法,这些算法速度快,但准确性略有下降。在本章中,我们讨论短读映射中的当前问题,并表明正确映射读数并非易事。通过对真实数据和合成数据进行简单实验,我们证明不同的映射器根据数据类型可能会给出不同的结果,并且相当一部分唯一映射的读数可能被映射到错误的位置。此外,我们提供了关于基因组中随机匹配预期数量(E值)以及随机匹配概率与读长函数关系的简单统计结果。最后,我们展示质量得分包含用于映射的有价值信息,以及为何应以概率方式评估映射质量。最后,我们讨论在概率映射程序中考虑这些质量得分来提高当前方法性能的潜力。