Department of Earth Sciences, Lower Mountjoy, Durham University, Durham DH1 3LE, UK.
Bioinformatics. 2020 Dec 22;36(20):5007-5013. doi: 10.1093/bioinformatics/btaa614.
The Robinson-Foulds (RF) metric is widely used by biologists, linguists and chemists to quantify similarity between pairs of phylogenetic trees. The measure tallies the number of bipartition splits that occur in both trees-but this conservative approach ignores potential similarities between almost-identical splits, with undesirable consequences. 'Generalized' RF metrics address this shortcoming by pairing splits in one tree with similar splits in the other. Each pair is assigned a similarity score, the sum of which enumerates the similarity between two trees. The challenge lies in quantifying split similarity: existing definitions lack a principled statistical underpinning, resulting in misleading tree distances that are difficult to interpret. Here, I propose probabilistic measures of split similarity, which allow tree similarity to be measured in natural units (bits).
My new information-theoretic metrics outperform alternative measures of tree similarity when evaluated against a broad suite of criteria, even though they do not account for the non-independence of splits within a single tree. Mutual clustering information exhibits none of the undesirable properties that characterize other tree comparison metrics, and should be preferred to the RF metric.
The methods discussed in this article are implemented in the R package 'TreeDist', archived at https://dx.doi.org/10.5281/zenodo.3528123.
Supplementary data are available at Bioinformatics online.
罗宾逊-福尔德(RF)度量被生物学家、语言学家和化学家广泛用于量化对二叉树对之间的相似性。该度量方法计算在两棵树中发生的二分分裂的数量 - 但这种保守方法忽略了几乎相同的分裂之间潜在的相似性,这会带来不良后果。“广义”RF 度量通过将一棵树中的分裂与另一棵树中的相似分裂配对来解决这个问题。每对分配一个相似得分,其总和枚举了两棵树之间的相似性。挑战在于量化分裂的相似性:现有的定义缺乏有原则的统计基础,导致难以解释的误导性树距离。在这里,我提出了分裂相似性的概率度量,允许以自然单位(位)测量树的相似性。
我的新信息论度量方法在广泛的标准评估中优于其他树相似性度量方法,即使它们不考虑一棵树内分裂的非独立性。相互聚类信息没有表现出其他树比较度量所具有的不良特性,并且应该优先于 RF 度量。
本文讨论的方法在 R 包“TreeDist”中实现,存档于 https://dx.doi.org/10.5281/zenodo.3528123。
补充数据可在 Bioinformatics 在线获取。