Institut Pasteur, Unité de Biologie Moléculaire du Gène chez Extrêmophiles, Département de Microbiologie, 25 rue du Dr Roux, 75015 Paris, France.
BMC Evol Biol. 2010 Jul 13;10:210. doi: 10.1186/1471-2148-10-210.
The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step.
Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity.
BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/.
多序列比对的质量对系统发育推断的准确性起着重要作用。已经表明,去除模糊对齐区域,以及其他来源的偏差,如高度可变(饱和)的字符,可以提高许多系统发育重建方法的整体性能。目前的科学趋势是从大量的完整基因组中半自动提取大量序列数据集来构建系统发育树。由于这些方法不允许对每个数据集进行精确的手动编辑,因此需要高效的生物信息学工具来专门用于此对齐字符修剪步骤。
这里介绍了一种新的软件,名为 BMGE(基于熵的块映射和聚集),它旨在选择多序列比对中适合系统发育推断的区域。对于每个字符,BMGE 计算一个与熵值密切相关的得分。这些类似熵得分的计算是用 BLOSUM 或 PAM 相似性矩阵加权的,以区分每个对齐字符的生物预期和意外可变性。得分超过给定阈值的连续字符集被认为不适合系统发育推断,然后被删除。模拟分析表明,BMGE 执行的字符修剪产生了导致准确树的数据集,特别是对于包含远距离相关序列的比对。BMGE 还实现了修剪和重新编码方法,旨在最小化由于组成异质性导致的系统发育重建伪影。
BMGE 能够对 DNA、密码子或氨基酸序列的多序列比对进行生物学相关的修剪。Java 源代码和可执行文件可在 ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/ 免费获得。