Nasir Arshan, Kim Kyung Mo, Caetano-Anollés Gustavo
Department of Biosciences, COMSATS Institute of Information TechnologyIslamabad, Pakistan.
Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States.
Front Microbiol. 2017 Jun 23;8:1178. doi: 10.3389/fmicb.2017.01178. eCollection 2017.
Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.
理清病毒的起源和进化仍然是一个具有挑战性的命题。我们最近利用比较基因组学、系统发育基因组学和多维标度法研究了数千个完全测序的病毒和细胞蛋白质组中蛋白质结构域的全球分布。一棵描述蛋白质组进化的生命之树显示,病毒从树的基部出现,成为生命的第四个超群。一棵结构域之树表明现代病毒谱系起源于与细胞祖先共存的古代细胞。然而,最近有人认为,我们的树的生根以及病毒的基部位置是由小基因组(蛋白质组)大小人为诱导的。在这里,我们表明这些说法源于对分支系统学方法的误解和错误解读。树是无根重建的,因此,它们的拓扑结构不会因生根方法而扭曲。在树中追踪蛋白质组大小以及进化关系的多维视图,以及对叶稳定性和分类群排除/纳入的测试表明,最小的蛋白质组既不会被吸引到根部,也不会导致树的任何拓扑扭曲。模拟证实,分类群聚类模式与蛋白质组大小无关,而是由数据矩阵中已知进化亲属的存在决定的,这突出了在系统发育重建中进行更广泛分类群采样的必要性。相反,蛋白质组大小的系统发育追踪揭示了结构域词汇创新的放缓以及反映赫普斯定律的四种异速生长标度模式。这些模式解释了病毒和细胞生物体核心蛋白质组库在进化生长和积累过程中规模经济的增加,这类似于词汇量有限的人类语言的增长。结果调和了结构域频率分布的动态和静态观点,这与时空连续性公理一致,而时空连续性公理是进化思维的原则。