Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America.
PLoS One. 2010 Nov 15;5(11):e13999. doi: 10.1371/journal.pone.0013999.
Phylogenetic study of protein sequences provides unique and valuable insights into the molecular and genetic basis of important medical and epidemiological problems as well as insights about the origins and development of physiological features in present day organisms. Consensus phylogenies based on the bootstrap and other resampling methods play a crucial part in analyzing the robustness of the trees produced for these analyses.
Our focus was to increase the number of bootstrap replications that can be performed on large protein datasets using the maximum parsimony, distance matrix, and maximum likelihood methods. We have modified the PHYLIP package using MPI to enable large-scale phylogenetic study of protein sequences, using a statistically robust number of bootstrapped datasets, to be performed in a moderate amount of time. This paper discusses the methodology used to parallelize the PHYLIP programs and reports the performance of the parallel PHYLIP programs that are relevant to the study of protein evolution on several protein datasets.
Calculations that currently take a few days on a state of the art desktop workstation are reduced to calculations that can be performed over lunchtime on a modern parallel computer. Of the three protein methods tested, the maximum likelihood method scales the best, followed by the distance method, and then the maximum parsimony method. However, the maximum likelihood method requires significant memory resources, which limits its application to more moderately sized protein datasets.
蛋白质序列的系统发育研究为解决重要医学和流行病学问题提供了独特而有价值的分子遗传学基础,也为现今生物生理特征的起源和发展提供了新的认识。基于自举和其他重采样方法的共识系统发育在分析这些分析产生的树的稳健性方面起着至关重要的作用。
我们的重点是增加最大简约法、距离矩阵法和最大似然法在大型蛋白质数据集上进行自举复制的次数。我们使用 MPI 修改了 PHYLIP 包,以便在合理的时间内使用统计上稳健的自举数据集来大规模进行蛋白质序列的系统发育研究。本文讨论了用于并行化 PHYLIP 程序的方法,并报告了并行 PHYLIP 程序在几个蛋白质数据集上进行蛋白质进化研究的性能。
目前在最先进的桌面工作站上需要几天时间的计算,可以在现代并行计算机上的午餐时间内完成。在测试的三种蛋白质方法中,最大似然法的扩展性最好,其次是距离法,然后是最大简约法。然而,最大似然法需要大量的内存资源,这限制了它在更大规模的蛋白质数据集上的应用。