Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
Nucleic Acids Res. 2011 Aug;39(15):6359-68. doi: 10.1093/nar/gkr334. Epub 2011 May 16.
Multiple sequence alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between alignment quality and guide tree uncertainty in progressive alignment methods, we directly generate suboptimal alignments from an input multiple sequence alignment by a probabilistic sampling method, and compute the agreement of the input alignment with the suboptimal alignments as the alignment reliability score. We construct the suboptimal alignments by an approximate method that is based on pairwise comparisons between each single sequence and the sub-alignment of the input alignment where the chosen sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal alignments are highly informative source for assessing alignment reliability. We apply the PSAR method to the alignments in the UCSC Genome Browser to measure the reliability of alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study.
多序列比对对于比较基因组学至关重要,但它是一个困难且容易出错的问题。因此,衡量比对的可靠性并将其纳入下游分析至关重要。我们提出了一种新的基于概率抽样的比对可靠性(PSAR)评分方法。我们不是依赖启发式假设,例如渐进比对方法中比对质量与引导树不确定性之间的相关性,而是通过概率抽样方法直接从输入的多序列比对中生成次优比对,并计算输入比对与次优比对的一致性作为比对可靠性评分。我们通过一种近似方法构建次优比对,该方法基于每个单序列与输入比对中被选中序列排除的子比对之间的两两比较。通过基于模拟的基准测试,我们发现我们的方法优于现有方法,支持次优比对是评估比对可靠性的高度信息来源。我们将 PSAR 方法应用于 UCSC 基因组浏览器中的比对,以衡量不同类型区域(如编码外显子和保守非编码区域)中比对的可靠性,并将其用于指导跨物种保守性研究。