Price Morgan N, Dehal Paramvir S, Arkin Adam P
Physical Biosciences Division, Lawrence Berkeley National Laboratory, CA, USA.
Mol Biol Evol. 2009 Jul;26(7):1641-50. doi: 10.1093/molbev/msp077. Epub 2009 Apr 17.
Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N(2)) space and O(N(2)L) time, but FastTree requires just O(NLa + N ) memory and O(N log (N)La) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 h and 2.4 GB of memory. Just computing pairwise Jukes-Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 h and 50 GB of memory. In simulations, FastTree was slightly more accurate than Neighbor-Joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.
基因家族正在迅速增长,但推断系统发育的标准方法无法扩展到处理超过10000个序列的比对。我们提出了FastTree,一种构建大型系统发育树并估计其可靠性的方法。FastTree不是存储距离矩阵,而是存储树中内部节点的序列概况。FastTree使用这些概况来实现邻接法,并使用启发式方法快速识别候选连接。然后,FastTree使用最近邻交换来缩短树的长度。对于一个有N个序列、L个位点和a个不同字符的比对,距离矩阵需要O(N(2))的空间和O(N(2)L)的时间,但FastTree只需要O(NLa + N)的内存和O(N log (N)La)的时间。为了估计树的可靠性,FastTree使用局部自展法,这比距离矩阵法又快了100倍。例如,FastTree在17小时内使用2.4GB内存计算出了158022个不同的16S核糖体RNA的树和支持值。仅仅计算成对的Jukes-Cantor距离并存储它们,而不推断树或进行自展法,就需要17小时和50GB内存。在模拟中,FastTree比邻接法、BIONJ或FastME稍微更准确一些;在真实比对中,FastTree的拓扑结构具有更高的似然性。FastTree可在http://microbesonline.org/fasttree获取。