PhyloTune：一种使用预训练DNA语言模型加速系统发育更新的有效方法。

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model.

作者信息

Deng Danruo, Xu Wuqin, Wu Bian, Comes Hans Peter, Feng Yu, Li Pan, Zheng Jinfang, Chen Guangyong, Heng Pheng-Ann

机构信息

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.

Zhejiang Lab, Kechuang Avenue, Hangzhou, China.

出版信息

Nat Commun. 2025 Jul 26;16(1):6905. doi: 10.1038/s41467-025-61684-3.

DOI:10.1038/s41467-025-61684-3

PMID:40715068

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12297363/

Abstract

Understanding the phylogenetic relationships among species is crucial for comprehending major evolutionary transitions. Despite the ever-growing volume of sequence data, constructing reliable phylogenetic trees effectively becomes more challenging for current analytical methods. In this study, we introduce a new solution to accelerate the integration of novel taxa into an existing phylogenetic tree using a pretrained DNA language model. Our approach identifies the taxonomic unit of a newly collected sequence using existing taxonomic classification systems and updates the corresponding subtree. Specifically, we leverage a pretrained BERT network to obtain high-dimensional sequence representations, which are used not only to determine the subtree to be updated, but also identify potentially valuable regions for subtree construction. We demonstrate the effectiveness of our method, named PhyloTune, through experiments on simulated datasets, as well as our curated Plant (focusing on Embryophyta) and microbial (focusing on Bordetella genus) datasets. Our findings provide evidence that phylogenetic trees can be constructed by automatically selecting the most informative regions of sequences, without manual selection of molecular markers. This discovery offers a guide for further research into the functional aspects of different regions of DNA sequences, enriching our understanding of biology.

摘要

了解物种之间的系统发育关系对于理解主要的进化转变至关重要。尽管序列数据量不断增加，但对于当前的分析方法来说，有效地构建可靠的系统发育树变得更具挑战性。在本研究中，我们引入了一种新的解决方案，使用预训练的DNA语言模型加速将新分类群整合到现有的系统发育树中。我们的方法使用现有的分类系统识别新收集序列的分类单元，并更新相应的子树。具体来说，我们利用预训练的BERT网络获得高维序列表示，这些表示不仅用于确定要更新的子树，还用于识别子树构建中潜在有价值的区域。我们通过对模拟数据集以及我们精心整理的植物（专注于胚植物）和微生物（专注于博德特氏菌属）数据集进行实验，证明了我们名为PhyloTune的方法的有效性。我们的研究结果表明，可以通过自动选择序列中信息最丰富的区域来构建系统发育树，而无需人工选择分子标记。这一发现为进一步研究DNA序列不同区域的功能方面提供了指导，丰富了我们对生物学的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b853/12297363/388453848235/41467_2025_61684_Fig1_HTML.jpg

相似文献

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model.PhyloTune：一种使用预训练DNA语言模型加速系统发育更新的有效方法。

Nat Commun. 2025 Jul 26;16(1):6905. doi: 10.1038/s41467-025-61684-3.

Short-Term Memory Impairment短期记忆障碍

The quantity, quality and findings of network meta-analyses evaluating the effectiveness of GLP-1 RAs for weight loss: a scoping review.评估胰高血糖素样肽-1受体激动剂（GLP-1 RAs）减肥效果的网状Meta分析的数量、质量及结果：一项范围综述

Health Technol Assess. 2025 Jun 25:1-73. doi: 10.3310/SKHT8119.

Factors that impact on the use of mechanical ventilation weaning protocols in critically ill adults and children: a qualitative evidence-synthesis.影响重症成人和儿童机械通气撤机方案使用的因素：一项定性证据综合分析

Cochrane Database Syst Rev. 2016 Oct 4;10(10):CD011812. doi: 10.1002/14651858.CD011812.pub2.

The Lived Experience of Autistic Adults in Employment: A Systematic Search and Synthesis.成年自闭症患者的就业生活经历：系统检索与综述

Autism Adulthood. 2024 Dec 2;6(4):495-509. doi: 10.1089/aut.2022.0114. eCollection 2024 Dec.

The 2 Sigma Genus Concept in mammalogy: Lessons from Lasiurus.哺乳动物学中的双西格玛属概念：来自红蝙蝠属的经验教训。

PLoS One. 2025 Jun 25;20(6):e0325554. doi: 10.1371/journal.pone.0325554. eCollection 2025.

Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records.通过整合外部知识提高预训练语言模型的临床相关性：来自电子健康记录的心血管诊断案例研究

JMIR AI. 2024 Aug 6;3:e56932. doi: 10.2196/56932.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

ScITree: Scalable Bayesian inference of transmission tree from epidemiological and genomic data.ScITree：从流行病学和基因组数据中对传播树进行可扩展的贝叶斯推断。

PLoS Comput Biol. 2025 Jun 10;21(6):e1012657. doi: 10.1371/journal.pcbi.1012657. eCollection 2025 Jun.

A regression based approach to phylogenetic reconstruction from multi-sample bulk DNA sequencing of tumors.一种基于回归的方法，用于从肿瘤的多样本批量DNA测序进行系统发育重建。

PLoS Comput Biol. 2024 Dec 4;20(12):e1012631. doi: 10.1371/journal.pcbi.1012631. eCollection 2024 Dec.

本文引用的文献

DNABERT-S: pioneering species differentiation with species-aware DNA embeddings.DNABERT-S：利用物种感知DNA嵌入技术实现开创性的物种分化

Bioinformatics. 2025 Jul 1;41(Supplement_1):i255-i264. doi: 10.1093/bioinformatics/btaf188.

Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks.Phyloformer：使用深度神经网络进行快速、准确且通用的系统发育重建。

Mol Biol Evol. 2025 Apr 1;42(4). doi: 10.1093/molbev/msaf051.

Phylogenomics of Bivalvia Using Ultraconserved Elements Reveal New Topologies for Pteriomorphia and Imparidentia.利用超保守元件的双壳贝类系统发育基因组学揭示了翼形亚纲和异齿亚纲的新拓扑结构。

Syst Biol. 2025 Feb 10;74(1):16-33. doi: 10.1093/sysbio/syae052.

Phylogenomics resolves the backbone of Poales and identifies signals of hybridization and polyploidy.系统发生基因组学解决了禾本目植物的系统发育关系，并鉴定了杂交和多倍体化的信号。

Mol Phylogenet Evol. 2024 Nov;200:108184. doi: 10.1016/j.ympev.2024.108184. Epub 2024 Aug 30.

Phylogenomics and the rise of the angiosperms.系统发生基因组学与被子植物的兴起。

Nature. 2024 May;629(8013):843-850. doi: 10.1038/s41586-024-07324-0. Epub 2024 Apr 24.

Complexity of avian evolution revealed by family-level genomes.鸟类进化的复杂性由家族水平基因组揭示。

Nature. 2024 May;629(8013):851-860. doi: 10.1038/s41586-024-07323-1. Epub 2024 Apr 1.

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction.迈克：一种用于构建系统发育树的超快、无需组装和无需对齐的方法。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae154.

A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche.一种独特的具核梭杆菌（Fusobacterium nucleatum）分支在结直肠癌生态位中占据主导地位。

Nature. 2024 Apr;628(8007):424-432. doi: 10.1038/s41586-024-07182-w. Epub 2024 Mar 20.

Phylogeny and molecular evolution of the first local monkeypox virus cluster in Guangdong Province, China.中国广东省首例地方性猴痘病毒簇的系统发育和分子进化。

Nat Commun. 2023 Dec 12;14(1):8241. doi: 10.1038/s41467-023-44092-3.

Major Revisions in Pancrustacean Phylogeny and Evidence of Sensitivity to Taxon Sampling.泛甲壳动物系统发育的重大修订和对分类群采样敏感性的证据。

Mol Biol Evol. 2023 Aug 3;40(8). doi: 10.1093/molbev/msad175.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PhyloTune：一种使用预训练DNA语言模型加速系统发育更新的有效方法。

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献