Ninio Matan, Privman Eyal, Pupko Tal, Friedman Nir
The Selim and Rachel Benin School of Computer Science and Engineering, Hebrew University Jerusalem 91904, Israel.
Bioinformatics. 2007 Jan 15;23(2):e136-41. doi: 10.1093/bioinformatics/btl304.
Distance-based methods for phylogeny reconstruction are the fastest and easiest to use, and their popularity is accordingly high. They are also the only known methods that can cope with huge datasets of thousands of sequences. These methods rely on evolutionary distance estimation and are sensitive to errors in such estimations. In this study, a novel Bayesian method for estimation of evolutionary distances is developed. The proposed method enables the use of a sophisticated evolutionary model that better accounts for among-site rate variation (ASRV), thereby improving the accuracy of distance estimation. Rate variations are estimated within a Bayesian framework by extracting information from the entire dataset of sequences, unlike standard methods that can only use one pair of sequences at a time. We compare the accuracy of a cascade of distance estimation methods, starting from commonly used methods and moving towards the more sophisticated novel method. Simulation studies show significant improvements in the accuracy of distance estimation by the novel method over the commonly used ones. We demonstrate the effect of the improved accuracy on tree reconstruction using both real and simulated protein sequence alignments. An implementation of this method is available as part of the SEMPHY package.
基于距离的系统发育重建方法是最快且最易于使用的,因此其受欢迎程度很高。它们也是已知的唯一能够处理包含数千个序列的巨大数据集的方法。这些方法依赖于进化距离估计,并且对这种估计中的误差很敏感。在本研究中,开发了一种用于估计进化距离的新型贝叶斯方法。所提出的方法能够使用一种更复杂的进化模型,该模型能更好地解释位点间速率变化(ASRV),从而提高距离估计的准确性。速率变化是在贝叶斯框架内通过从整个序列数据集中提取信息来估计的,这与标准方法不同,标准方法一次只能使用一对序列。我们比较了一系列距离估计方法的准确性,从常用方法开始,逐步转向更复杂的新方法。模拟研究表明,新方法在距离估计准确性方面比常用方法有显著提高。我们使用真实和模拟的蛋白质序列比对展示了提高的准确性对树重建的影响。该方法的一个实现作为SEMPHY软件包的一部分可用。