Department of Information Technology, Ghent University, Ghent, Belgium.
Bioinformatics. 2011 Mar 15;27(6):749-56. doi: 10.1093/bioinformatics/btr008. Epub 2011 Jan 6.
Many comparative genomics studies rely on the correct identification of homologous genomic regions using accurate alignment tools. In such case, the alphabet of the input sequences consists of complete genes, rather than nucleotides or amino acids. As optimal multiple sequence alignment is computationally impractical, a progressive alignment strategy is often employed. However, such an approach is susceptible to the propagation of alignment errors in early pairwise alignment steps, especially when dealing with strongly diverged genomic regions. In this article, we present a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists.
Based on provable properties of the graph structure, several heuristics are developed to resolve local alignment conflicts that occur due to gene duplication and/or rearrangement events on the different genomic segments. The performance of the algorithm is assessed by comparing the alignment results of homologous genomic segments in Arabidopsis thaliana to those obtained by using both a progressive alignment method and an earlier graph-based implementation. Especially for datasets that contain strongly diverged segments, the proposed method achieves a substantially higher alignment accuracy, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes.
http://bioinformatics.psb.ugent.be/software. The algorithm is implemented as a part of the i-ADHoRe 3.0 package.
许多比较基因组学研究依赖于使用准确的对齐工具正确识别同源基因组区域。在这种情况下,输入序列的字母表由完整的基因组成,而不是核苷酸或氨基酸。由于最优的多重序列比对在计算上是不切实际的,因此通常采用渐进对齐策略。然而,这种方法容易在早期的两两比对步骤中传播比对错误,尤其是在处理高度分化的基因组区域时。在本文中,我们提出了一种新的准确而高效的贪婪、基于图的算法,用于对齐多个同源基因片段,这些片段表示为有序的基因列表。
基于图结构的可证明性质,开发了几种启发式方法来解决由于不同基因组片段上的基因复制和/或重排事件而导致的局部比对冲突。通过将拟南芥同源基因片段的比对结果与使用渐进比对方法和早期基于图的实现方法得到的结果进行比较,评估了算法的性能。特别是对于包含高度分化片段的数据集,所提出的方法实现了更高的比对准确性,并且对于包含几十个真核基因组的大型数据集来说,速度足够快。
http://bioinformatics.psb.ugent.be/software。该算法作为 i-ADHoRe 3.0 包的一部分实现。