Wang Shu, Gutell Robin R, Miranker Daniel P
Department of Electrical and Computer Engineering, School of Biological Sciences, University of Texas At Austin, Austin, TX 78712, USA.
Bioinformatics. 2007 Dec 15;23(24):3289-96. doi: 10.1093/bioinformatics/btm485. Epub 2007 Oct 6.
Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address.
We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions.
BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/
双聚类是一种同时对关系的域和值域进行聚类的聚类方法。多序列比对(MSA)中的一个挑战是,序列比对通常旨在揭示保守功能子序列的组。同时,序列的分组会影响比对;而这正是双聚类旨在解决的那种双重情况。
我们定义了一种MSA问题的表示形式,使得双聚类算法能够得以应用。我们开发了一个用于局部MSA的计算机程序BlockMSA,它将双聚类与分治法相结合。BlockMSA同时找到相似序列的组,并在这些组内局部比对子序列。通过对序列集及其内容进行划分来完成进一步的比对。最终结果既是一个多序列比对,也是序列的层次聚类。BlockMSA在BRAliBase 2.1基准测试套件中显示出高变异性的子集上进行了测试,并在该套件扩展到更大问题规模时进行了测试。此外,还对当前生物学感兴趣的两个大型数据集,即T盒序列和IC1组内含子进行了比对评估。使用双对和(SPS)和一致性计数,将结果与由ClustalW、MAFFT、MUCLE和PROBCONS比对程序计算的比对结果进行了比较。基准测试套件的结果对问题规模很敏感。在15个或更多序列的问题上,BlockMSA始终是最好的。在测试套件的任何问题中,BlockMSA、MAFFT和PROBCONS之间的得分都没有明显差异。在T盒序列上,BlockMSA在重现已知注释方面做得最忠实。MAFFT和PROBCONS则不然。在内含子序列上,BlockMSA、MAFFT和MUSCLE在识别保守区域方面相当。
BlockMSA用Java实现。源代码和补充数据集可在http://aug.csres.utexas.edu/msa/获取。