使用深度学习从多重序列比对中准确推断树拓扑结构。

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning.

机构信息

Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, UNC-Chapel Hill, Chapel Hill, NC 27599-7264, USA.

Biological and Biomedical Sciences Program, University of North Carolina at Chapel Hill, 130 Mason Farm Road, UNC-Chapel Hill Chapel Hill, NC 27599-7264, USA.

出版信息

Syst Biol. 2020 Mar 1;69(2):221-233. doi: 10.1093/sysbio/syz060.

DOI:10.1093/sysbio/syz060

PMID:31504938

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8204903/

Abstract

Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several "zones" of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

摘要

重建物种之间的系统发育关系是进化生物学中最具挑战性的任务之一。有多种方法可以重建系统发育树，每种方法都有其自身的优缺点。模拟和实证研究都确定了几个“参数空间区域”，在这些区域中，某些方法的准确性会大幅下降，即使是对于四分类树也是如此。此外，一些方法可能具有不理想的统计特性，例如统计不一致性和/或正向误导的倾向（即断言对错误的树拓扑有很强的支持）。最近，深度学习技术在生物学研究的许多新的和长期存在的问题上都取得了进展。在这项研究中，我们设计了一个深度卷积神经网络（CNN），从多个序列比对中推断四联体拓扑结构。这个 CNN 可以很容易地接受训练，以便使用有缺口和无缺口的数据进行推断。我们表明，我们的方法在模拟数据上具有很高的准确性，通常优于传统方法，并且对参数空间中的偏差诱导区域（如费尔斯坦区域和法里斯区域）具有很强的鲁棒性。我们还表明，我们的 CNN 产生的置信分数比传统方法的自举和后验概率分数更能准确评估所选拓扑的支持程度。尽管仍然存在许多实际挑战，但这些发现表明，深度学习方法（如我们的方法）有可能产生更准确的系统发育推断。

相似文献

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning.

Syst Biol. 2020 Mar 1;69(2):221-233. doi: 10.1093/sysbio/syz060.

Reliable estimation of tree branch lengths using deep neural networks.

PLoS Comput Biol. 2024 Aug 5;20(8):e1012337. doi: 10.1371/journal.pcbi.1012337. eCollection 2024 Aug.

Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments.

Mol Phylogenet Evol. 2024 Nov;200:108181. doi: 10.1016/j.ympev.2024.108181. Epub 2024 Aug 30.

Deep Residual Neural Networks Resolve Quartet Molecular Phylogenies.

Mol Biol Evol. 2020 May 1;37(5):1495-1507. doi: 10.1093/molbev/msz307.

Re-evaluating Deep Neural Networks for Phylogeny Estimation: The Issue of Taxon Sampling.

J Comput Biol. 2022 Jan;29(1):74-89. doi: 10.1089/cmb.2021.0383. Epub 2022 Jan 5.

Invariant transformers of Robinson and Foulds distance matrices for Convolutional Neural Network.

J Bioinform Comput Biol. 2022 Aug;20(4):2250012. doi: 10.1142/S0219720022500123. Epub 2022 Jul 6.

Learning From an Artificial Neural Network in Phylogenetics.

IEEE/ACM Trans Comput Biol Bioinform. 2024 Mar-Apr;21(2):278-288. doi: 10.1109/TCBB.2024.3352268. Epub 2024 Apr 3.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

Larger, unfiltered datasets are more effective at resolving phylogenetic conflict: Introns, exons, and UCEs resolve ambiguities in Golden-backed frogs (Anura: Ranidae; genus Hylarana).

Mol Phylogenet Evol. 2020 Oct;151:106899. doi: 10.1016/j.ympev.2020.106899. Epub 2020 Jun 24.

What is the danger of the anomaly zone for empirical phylogenetics?

Syst Biol. 2009 Oct;58(5):527-36. doi: 10.1093/sysbio/syp047. Epub 2009 Aug 26.

引用本文的文献

A quartet-based approach for inferring phylogenetically informative features from genomic and phenomic data.

Comput Struct Biotechnol J. 2025 Aug 22;27:3710-3718. doi: 10.1016/j.csbj.2025.08.015. eCollection 2025.

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model.

Nat Commun. 2025 Jul 26;16(1):6905. doi: 10.1038/s41467-025-61684-3.

Detecting Interspecific Positive Selection Using Convolutional Neural Networks.

Mol Biol Evol. 2025 Jul 1;42(7). doi: 10.1093/molbev/msaf154.

Opportunities and Challenges in Applying AI to Evolutionary Morphology.

Integr Org Biol. 2024 Sep 23;6(1):obae036. doi: 10.1093/iob/obae036. eCollection 2024.

Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks.

Mol Biol Evol. 2025 Apr 1;42(4). doi: 10.1093/molbev/msaf051.

Current state and future prospects of Horizontal Gene Transfer detection.

NAR Genom Bioinform. 2025 Feb 11;7(1):lqaf005. doi: 10.1093/nargab/lqaf005. eCollection 2025 Mar.

BAD2matrix: Phylogenomic matrix concatenation, indel coding, and more.

Appl Plant Sci. 2024 Sep 24;12(6):e11604. doi: 10.1002/aps3.11604. eCollection 2024 Nov-Dec.

Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications.

Mol Biol Evol. 2024 Sep 4;41(9). doi: 10.1093/molbev/msae177.

phyddle: software for exploring phylogenetic models with deep learning.

bioRxiv. 2025 Feb 28:2024.08.06.606717. doi: 10.1101/2024.08.06.606717.

Reliable estimation of tree branch lengths using deep neural networks.

PLoS Comput Biol. 2024 Aug 5;20(8):e1012337. doi: 10.1371/journal.pcbi.1012337. eCollection 2024 Aug.

本文引用的文献

Success of Parsimony in the Four-Taxon Case: Long-Branch Repulsion by Likelihood in the Farris Zone.

Cladistics. 1998 Sep;14(3):209-220. doi: 10.1111/j.1096-0031.1998.tb00334.x.

A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks.

Adv Neural Inf Process Syst. 2018 Dec;31:8594-8605.

The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference.

Mol Biol Evol. 2019 Feb 1;36(2):220-238. doi: 10.1093/molbev/msy224.

Alignment Modulates Ancestral Sequence Reconstruction Accuracy.

Mol Biol Evol. 2018 Jul 1;35(7):1783-1797. doi: 10.1093/molbev/msy055.

Supervised Machine Learning for Population Genetics: A New Paradigm.

Trends Genet. 2018 Apr;34(4):301-312. doi: 10.1016/j.tig.2017.12.005. Epub 2018 Jan 10.

CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP.

Evolution. 1985 Jul;39(4):783-791. doi: 10.1111/j.1558-5646.1985.tb00420.x.

Detecting false positive sequence homology: a machine learning approach.

BMC Bioinformatics. 2016 Feb 24;17:101. doi: 10.1186/s12859-016-0955-3.

Maximum Likelihood Phylogenetic Inference is Consistent on Multiple Sequence Alignments, with or without Gaps.

Syst Biol. 2016 Mar;65(2):328-33. doi: 10.1093/sysbio/syv089. Epub 2015 Nov 28.

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data - The impact of model mis-specification in distance corrections.

Mol Phylogenet Evol. 2015 Dec;93:289-95. doi: 10.1016/j.ympev.2015.07.027. Epub 2015 Aug 6.

Phylogenomics with paralogs.

Proc Natl Acad Sci U S A. 2015 Feb 17;112(7):2058-63. doi: 10.1073/pnas.1412770112. Epub 2015 Feb 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用深度学习从多重序列比对中准确推断树拓扑结构。

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning.

机构信息

Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, UNC-Chapel Hill, Chapel Hill, NC 27599-7264, USA.

Biological and Biomedical Sciences Program, University of North Carolina at Chapel Hill, 130 Mason Farm Road, UNC-Chapel Hill Chapel Hill, NC 27599-7264, USA.

出版信息

Syst Biol. 2020 Mar 1;69(2):221-233. doi: 10.1093/sysbio/syz060.

DOI:10.1093/sysbio/syz060

PMID:31504938

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8204903/

Abstract

摘要

使用深度学习从多重序列比对中准确推断树拓扑结构。

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

使用深度学习从多重序列比对中准确推断树拓扑结构。

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning.

机构信息

出版信息