Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada.
Center of Artificial Intelligence, Astrakhan State University, Astrakhan 414056, Russia.
J Bioinform Comput Biol. 2022 Aug;20(4):2250012. doi: 10.1142/S0219720022500123. Epub 2022 Jul 6.
The evolutionary histories of genes are susceptible of differing greatly from each other which could be explained by evolutionary variations in horizontal gene transfers or biological recombinations. A phylogenetic tree would therefore represent the evolutionary history of each gene, which may present different patterns from the species tree that defines the main evolutionary patterns. In addition, phylogenetic trees of closely related species should be merged, thus minimizing the topological conflicts they present and obtaining consensus trees (in the case of homogeneous data) or supertrees (in the case of heterogeneous data). The traditional approaches are consensus tree inference (if the set of trees contains the same set of species) or supertrees (if the set of trees contains different, but overlapping sets of species). Consensus trees and supertrees are constructed to produce unique trees. However, these methods lose precision with respect to different evolutionary variability. Other approaches have been implemented to preserve this variability using the [Formula: see text]-means algorithm or the [Formula: see text]-medoids algorithm. Using a new method, we determine all possible consensus trees and supertrees that best represent the most significant evolutionary models in a set of phylogenetic trees, thereby increasing the precision of the results and decreasing the time required. This paper presents in detail a new method for predicting the number of clusters in a Robinson and Foulds (RF) distance matrix using a convolutional neural network (CNN). We developed a new CNN approach (called CNNTrees) for multiple tree classification. This new strategy returns a number of clusters of the input phylogenetic trees for different-size sets of trees, which makes the new approach more stable and more robust. The paper provides an in-depth analysis of the relevant, but very difficult, problem of constructing alternative supertrees using phylogenies with different but overlapping sets of taxa. This new model will play an important role in the inference of Trees of Life (ToL). CNNTrees is available through a web server at https://tahirinadia.github.io/. The source code, data and information about installation procedures are also available at https://github.com/TahiriNadia/CNNTrees. Supplementary data are available on GitHub platform. The evolutionary history of species is not unique, but is specific to sets of genes. Indeed, each gene has its own evolutionary history that differs considerably from one gene to another. For example, some individual genes or operons may be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene must be represented by its own phylogenetic tree, which may exhibit different evolutionary patterns than the species tree that accounts for the major vertical descent patterns. The result of traditional consensus tree or supertree inference methods is a single consensus tree or supertree. In this paper, we present in detail a new method for predicting the number of clusters in a Robinson and Foulds (RF) distance matrix using a convolutional neural network (CNN). We developed a new CNN approach (CNNTrees) to construct multiple tree classification. This new strategy returns a number of clusters in the order of the input trees, which allows this new approach to be more stable and also more robust.
基因的进化历史可能彼此差异很大,这可以用水平基因转移或生物重组的进化变化来解释。因此,系统发育树将代表每个基因的进化历史,它可能呈现出与定义主要进化模式的物种树不同的模式。此外,应合并密切相关物种的系统发育树,从而最小化它们呈现的拓扑冲突,并获得共识树(在同质数据的情况下)或超树(在异质数据的情况下)。传统方法是共识树推断(如果树集合包含相同的物种集合)或超树推断(如果树集合包含不同但重叠的物种集合)。共识树和超树的构建是为了生成唯一的树。然而,这些方法在不同的进化可变性方面精度较低。已经实施了其他方法来使用 [公式:见文本]-均值算法或 [公式:见文本]-中位数算法来保留这种可变性。使用一种新方法,我们确定了所有可能的共识树和超树,这些树最好地代表了一组系统发育树中的最重要进化模型,从而提高了结果的精度并减少了所需的时间。本文详细介绍了一种使用卷积神经网络(CNN)预测罗宾逊和福尔德斯(RF)距离矩阵中聚类数目的新方法。我们开发了一种新的 CNN 方法(称为 CNNTrees)用于多树分类。这种新策略按输入系统发育树的顺序返回聚类的数量,这使得新方法更加稳定和稳健。本文深入分析了使用具有不同但重叠的分类单元集的系统发育来构建替代超树的相关但非常困难的问题。这个新模型将在生命之树(ToL)的推断中发挥重要作用。CNNTrees 可通过 https://tahirinadia.github.io/ 上的网络服务器获得。源代码、数据和有关安装过程的信息也可在 https://github.com/TahiriNadia/CNNTrees 上获得。补充数据可在 GitHub 平台上获得。物种的进化历史不是唯一的,而是特定于基因集的。事实上,每个基因都有自己的进化历史,与其他基因有很大的不同。例如,一些个体基因或操纵子可能受到特定的水平基因转移和重组事件的影响。因此,每个基因的进化历史都必须由其自身的系统发育树来表示,该系统发育树可能表现出与解释主要垂直遗传模式的物种树不同的进化模式。传统的共识树或超树推断方法的结果是一个单一的共识树或超树。在本文中,我们详细介绍了一种使用卷积神经网络(CNN)预测罗宾逊和福尔德斯(RF)距离矩阵中聚类数目的新方法。我们开发了一种新的 CNN 方法(称为 CNNTrees)来构建多树分类。这种新策略按输入树的顺序返回聚类的数量,这使得这种新方法更加稳定和稳健。