Noble Robert, Verity Kimberley
Department of Mathematics, City, University of London, London, UK.
bioRxiv. 2024 Dec 17:2023.07.17.549219. doi: 10.1101/2023.07.17.549219.
The comparison and categorization of tree diagrams is fundamental to large parts of biology, linguistics, computer science, and other fields, yet the indices currently applied to describing tree shape have important flaws that complicate their interpretation and limit their scope. Here we introduce a new system of indices with no such shortcomings. Our indices account for node sizes and branch lengths and are robust to small changes in either attribute. Unlike currently popular phylogenetic diversity, phylogenetic entropy, and tree balance indices, our definitions assign interpretable values to all rooted trees and enable meaningful comparison of any pair of trees. Our self-consistent definitions further unite measures of diversity, richness, balance, symmetry, effective height, effective outdegree, and effective branch count in a coherent system, and we derive numerous simple relationships between these indices. The main practical advantages of our indices are in 1) quantifying diversity in non-ultrametric trees; 2) assessing the balance of trees that have non-uniform branch lengths or node sizes; 3) comparing the balance of trees with different leaf counts or outdegrees; 4) obtaining a coherent, generic, multidimensional quantification of tree shape that is robust to sampling error and inferential error. We illustrate these features by comparing the shapes of trees representing the evolution of HIV and of Uralic languages, and trees generated by computational models of tumour evolution. Given the ubiquity of tree structures, we identify a wide range of applications across diverse domains.
树形图的比较和分类是生物学、语言学、计算机科学及其他许多领域的基础,但目前用于描述树形的指标存在重大缺陷,这使得它们的解释变得复杂,并限制了其应用范围。在此,我们引入了一种不存在此类缺点的新指标体系。我们的指标考虑了节点大小和分支长度,并且对这两个属性中的任何一个的微小变化都具有稳健性。与目前流行的系统发育多样性、系统发育熵和树形平衡指标不同,我们的定义为所有有根树赋予了可解释的值,并能够对任意两棵树进行有意义的比较。我们自洽的定义进一步将多样性、丰富度、平衡、对称、有效高度、有效出度和有效分支数的度量统一在一个连贯的系统中,并且我们推导出了这些指标之间的许多简单关系。我们指标的主要实际优势在于:1)量化非超度量树中的多样性;2)评估具有不均匀分支长度或节点大小的树的平衡;3)比较具有不同叶数或出度的树的平衡;4)获得一个连贯、通用、多维的树形量化,该量化对抽样误差和推断误差具有稳健性。我们通过比较代表HIV和乌拉尔语系演化的树形图以及肿瘤演化计算模型生成的树形图的形状来说明这些特征。鉴于树形结构的普遍性,我们确定了广泛适用于不同领域的应用。