Thompson Julie D, Koehl Patrice, Ripp Raymond, Poch Olivier
Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Molculaire et Cellulaire, (CNRS/INSERM/ULP), Illkirch Cedex, France.
Proteins. 2005 Oct 1;61(1):127-36. doi: 10.1002/prot.20527.
Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, high-throughput technologies such as genome sequencing and structural proteomics have lead to an explosion in the amount of sequence and structure information available. In response, several new multiple alignment methods have been developed that improve both the efficiency and the quality of protein alignments. Consequently, the benchmarks used to evaluate and compare these methods must also evolve. We present here the latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences. Using a novel, semiautomatic update protocol, the number of protein families in the benchmark has been increased and representative test cases are now available that cover most of the protein fold space. The total number of proteins in BAliBASE has also been significantly increased from 1444 to 6255 sequences. In addition, full-length sequences are now provided for all test cases, which represent difficult cases for both global and local alignment programs. Finally, the BAliBASE Web site (http://www-bio3d-igbmc.u-strasbg.fr/balibase) has been completely redesigned to provide a more user-friendly, interactive interface for the visualization of the BAliBASE reference alignments and the associated annotations.
多序列比对是现代分子生物学的基石之一。它用于识别保守基序、确定蛋白质结构域、通过同源性进行二维/三维结构预测以及开展进化研究。最近,诸如基因组测序和结构蛋白质组学等高通量技术导致可用的序列和结构信息量呈爆炸式增长。作为回应,已经开发了几种新的多序列比对方法,这些方法提高了蛋白质比对的效率和质量。因此,用于评估和比较这些方法的基准也必须不断发展。我们在此展示了最广泛使用的多序列比对基准BAliBASE的最新版本,它基于三维结构叠加提供高质量、人工完善的参考比对。BAliBASE 3.0版包括新的、更具挑战性的测试案例,代表了比对大量复杂序列时遇到的实际问题。使用一种新颖的半自动更新协议,基准中的蛋白质家族数量有所增加,现在有代表性的测试案例涵盖了大部分蛋白质折叠空间。BAliBASE中的蛋白质总数也从1444个显著增加到6255个序列。此外,现在为所有测试案例提供了全长序列,这对全局和局部比对程序来说都是难题。最后,BAliBASE网站(http://www-bio3d-igbmc.u-strasbg.fr/balibase)已全面重新设计,以提供一个更用户友好的交互式界面,用于可视化BAliBASE参考比对及相关注释。