Schmidt Markus, Kutzner Arne, Heese Klaus
Department of Computer Science, Friedrich-Alexander University, Martensstr. 3, Erlangen-Nürnberg, Germany.
Department of Information Systems, College of Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul 133-791, Republic of Korea.
J Theor Biol. 2017 Aug 1;427:1-7. doi: 10.1016/j.jtbi.2017.05.008. Epub 2017 May 15.
Similarities among ortholog genes for a given set of species S can be expressed by alignment matrices, where each matrix cell results from aligning a gene transcript against the genome of a species within S. Gene clusters can be computed by using single-linkage clustering in time n × m, where n denotes the number of ortholog genes and m denotes the number of inspected assemblies. Our approach can break the O(n × m) complexity of single-linkage clustering by exploiting an order among species that results from an in-order traversal of a given phylogenetic tree. The order among species allows the reduction of the inspected scope of the matrix to taxonomically related combinations of assemblies and genes, thus lowering the computational efforts necessary for creating the alignment matrix without affecting cluster quality. We present two novel approaches for clustering. First, we introduce a hierarchical clustering with, omitting the initial sorting of |S| elements, amortized O(|S|) time behavior, where it holds |S|≤n+m. Then, we propose a consecutive clustering having a linear time complexity O(|S|). Both approaches compute identical clusters, whereas dendrograms can only be obtained from the hierarchical one. We prove that our approaches deliver higher cluster densities than single linkage clustering. Additionally, we show that we compute clusters of superior quality, which ensures that our approaches are generally less error prone.
给定物种集S的直系同源基因之间的相似性可以通过比对矩阵来表示,其中每个矩阵单元是通过将一个基因转录本与S内一个物种的基因组进行比对得到的。基因簇可以通过使用单链聚类在时间n×m内计算得出,其中n表示直系同源基因的数量,m表示检查的组装体数量。我们的方法可以通过利用给定系统发育树的中序遍历所产生的物种顺序来打破单链聚类的O(n×m)复杂度。物种顺序允许将矩阵的检查范围缩小到分类学上相关的组装体和基因组合,从而在不影响聚类质量的情况下降低创建比对矩阵所需的计算量。我们提出了两种新颖的聚类方法。首先,我们引入一种层次聚类,省略|S|个元素的初始排序,具有分摊的O(|S|)时间复杂度,其中|S|≤n+m。然后,我们提出一种具有线性时间复杂度O(|S|)的连续聚类。两种方法计算出的聚类相同,而树状图只能从层次聚类中获得。我们证明我们的方法比单链聚类具有更高的聚类密度。此外,我们表明我们计算出的聚类质量更高,这确保了我们的方法通常更不易出错。