Górecki Paweł, Markin Alexey, Vijendran Sriram, Eulenstein Oliver
Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Mazowieckie, Poland.
Virus and Prion Research Unit, National Animal Disease Center, USDA-ARS, Ames, Iowa, United States of America.
PLoS Comput Biol. 2025 Jun 10;21(6):e1013069. doi: 10.1371/journal.pcbi.1013069. eCollection 2025 Jun.
The cophenetic distance is a well-established metric in biology used to compare pairs of trees represented in a vector format. This distance was introduced by Cardona and his co-authors, building on the foundational work of Sokal and Rohlf, which dates back over 60 years. It is widely recognized for its versatility since it can analyze trees with edge weights using various vector norms. However, when comparing large-scale trees, the quadratic runtime of the current best-known (i.e., naïve) algorithm for computing the cophenetic distance can become prohibitive. Recently, a new algorithmic framework with near-linear time complexity has been developed to calculate the distances of a generalized class of cophenetic distances, which are derived from the work of Sokal and Rohlf. This improvement not only allows the cophenetic distance to be utilized in large-scale studies but also enhances the versatility of these studies by incorporating generalized variants of the cophenetic distance. However, the framework is limited to applying only the L1 and L2 vector norms, which significantly restricts the versatility of generalized cophenetic distances in large-scale applications. To address this limitation, we present a near-linear time algorithmic framework for computing the generalized cophenetic distances across all Lp vector norms. In our scalability study, we showcase the practical performance of our unrestricted algorithmic framework. Furthermore, we investigate the applicability of the generalized cophenetic distances by analyzing the distributions of key components of these distances under various vector norms.
共亲距离是生物学中一种成熟的度量标准,用于比较以向量格式表示的树对。这个距离是由卡多纳及其合著者引入的,它建立在索卡尔和罗尔夫60多年前的基础工作之上。它因其通用性而被广泛认可,因为它可以使用各种向量范数来分析带有边权重的树。然而,在比较大规模的树时,当前计算共亲距离的最著名(即朴素)算法的二次运行时间可能会变得过高。最近,一种具有近线性时间复杂度的新算法框架被开发出来,用于计算一类广义共亲距离的距离,这些距离源自索卡尔和罗尔夫的工作。这一改进不仅使共亲距离能够用于大规模研究,还通过纳入共亲距离的广义变体增强了这些研究的通用性。然而,该框架仅限于应用L1和L2向量范数,这在很大程度上限制了广义共亲距离在大规模应用中的通用性。为了解决这一限制,我们提出了一种近线性时间算法框架,用于计算所有Lp向量范数下的广义共亲距离。在我们的可扩展性研究中,我们展示了我们无限制算法框架的实际性能。此外,我们通过分析这些距离在各种向量范数下关键组成部分的分布,研究了广义共亲距离的适用性。