Li Heng, Ruan Jue, Durbin Richard
The Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom.
Genome Res. 2008 Nov;18(11):1851-8. doi: 10.1101/gr.078212.108. Epub 2008 Aug 19.
New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.
新的测序技术为DNA序列的使用带来了一个新时代。然而,其中一些技术产生的读段非常短,通常只有几十对碱基,要有效利用这些读段需要新的算法和软件。特别是,在将短读段高效比对到参考基因组以及处理比对中的模糊性或准确性不足方面存在一个主要问题。在此,我们引入映射质量的概念,它是衡量一个读段确实来自映射算法将其比对到的位置的置信度的指标。我们描述了软件MAQ,它可以通过将鸟枪法短读段映射到参考基因组来构建组装体,利用质量分数推导二倍体基因组(例如来自人类样本)的共有序列的基因型调用。MAQ充分利用了配对末端信息并估计每个读段比对的错误概率。对于最终的基因型调用,也使用贝叶斯统计模型推导错误概率,该模型纳入了映射质量、原始序列质量分数的错误概率、两种单倍型的抽样以及位点相关错误的经验模型。读段映射和基因型调用都在模拟数据和真实数据上进行了评估。MAQ准确、高效、通用且用户友好。它可从http://maq.sourceforge.net免费获取。