Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen, Denmark.
BMC Bioinformatics. 2014 Apr 9;15:100. doi: 10.1186/1471-2105-15-100.
Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases.
We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans).
The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.
现代 DNA 测序方法产生了大量的数据,这些数据通常需要映射到参考基因组上。大多数现有的程序使用读取与基因组之间的错配数量作为质量的衡量标准。这种方法没有统计学基础,对于某些数据类型,可能会导致许多错误映射的读取。在这里,我们提出了一种基于位置特异性评分矩阵的概率映射方法,该方法不仅可以考虑读取的质量分数,还可以考虑用户指定的进化模型和数据特定的偏差。
我们展示了如何自然地从概率角度处理进化、数据特定的偏差和测序错误。当对这些数据的偏差进行建模时,我们的方法在模拟和真实的古老和 PAR-CLIP 读取以及模拟的富含 AT 的疟原虫 P. falciparum 读取上,都优于 Bowtie 和 BWA,获得了更好的结果。对于模拟的 Illumina 读取,该方法对于单端和双端数据都具有更高的一致性灵敏度。我们还表明,我们的概率方法可以限制来自污染的短读取的随机匹配问题,并且可以提高来自一个生物体(D. melanogaster)的真实读取到相关基因组(D. simulans)的映射质量。
本文介绍的工作是一种短读映射新方法的实现,其中质量分数、先验错配概率和映射质量以统计合理的方式进行处理。所得到的实现不仅为处理低质量和/或偏差测序数据的生物学家提供了一种工具,而且还展示了在真实和模拟数据集上使用基于概率的对齐方法的可行性。