The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.
School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.
Bioinformatics. 2022 Jun 24;38(Suppl 1):i118-i124. doi: 10.1093/bioinformatics/btac252.
In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree.
Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance.
The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3.
Supplementary data are available at Bioinformatics online.
近年来,全基因组序列变得越来越可用,因此许多现代系统发育分析都是基于非常长的序列,通常有超过 100000 个位点。大规模对齐的系统发育重建对基于似然的系统发育推断程序具有挑战性,通常需要使用功能强大的计算机集群。当前用于系统发育分析前对齐修剪的工具并不能保证对齐大小的显著减少,并且据称会对获得的树的准确性产生负面影响。
在这里,我们提出了一种基于人工智能的方法,该方法提供了选择最佳位点子集的手段,并提供了一种基于该子集计算整个数据对数似然的公式。我们的方法基于训练正则化 Lasso 回归模型,该模型在对逼近使用的站点数量施加约束的同时优化对数似然预测精度。我们表明,基于 5%的站点计算似然值已经可以准确逼近基于整个数据的树似然值。此外,我们表明,在树搜索期间使用基于 Lasso 的逼近可以大大减少运行时间,同时保持相同的树搜索性能。
该代码是用 Python 3.8 编写的,并可通过 GitHub(https://github.com/noaeker/lasso_positions_sampling)获得。本文中使用的数据集是从 Zhou 等人(2018 年)检索到的,如第 3 节所述。
补充数据可在“生物信息学”在线获取。