Bromberg Raquel, Grishin Nick V, Otwinowski Zbyszek
Department of Biophysics and Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America.
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America.
PLoS Comput Biol. 2016 Jun 23;12(6):e1004985. doi: 10.1371/journal.pcbi.1004985. eCollection 2016 Jun.
Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz.
测序技术的进步已产生了大量完整的基因组。传统上,系统发育分析依赖于直系同源基因的比对,但定义直系同源基因并将它们与旁系同源基因区分开来是一项复杂的任务,可能并不总是适用于未来的大型数据集。传统的基于比对的方法的一种替代方法是全基因组、无比对方法。这些方法具有可扩展性,且需要最少的人工干预。我们开发了SlopeTree,这是一种新的无比对方法,它通过测量精确子串匹配随匹配长度的衰减来估计进化距离。SlopeTree可校正水平基因转移、组成变化和低复杂性序列,以及由同一位点的多个突变导致的分支长度非线性。我们在495种细菌、73种古细菌以及72株大肠杆菌和志贺氏菌上测试了SlopeTree。我们将我们构建的树与NCBI分类法、基于串联比对构建的树以及其他无比对方法生成的树进行了比较。结果与当前关于原核生物进化的知识一致。我们评估了不同方法和设置下树拓扑结构的差异,发现大多数细菌和古细菌都有一组通过遗传进化的核心蛋白质。在由完整基因组而非核心基因集构建的树中,我们观察到一些按表型而非系统发育的分组,例如一群还原硫的嗜热细菌聚集在一起,而不考虑它们的门。SlopeTree的源代码可在以下网址获取:http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz。