Deng Danruo, Xu Wuqin, Wu Bian, Comes Hans Peter, Feng Yu, Li Pan, Zheng Jinfang, Chen Guangyong, Heng Pheng-Ann
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.
Zhejiang Lab, Kechuang Avenue, Hangzhou, China.
Nat Commun. 2025 Jul 26;16(1):6905. doi: 10.1038/s41467-025-61684-3.
Understanding the phylogenetic relationships among species is crucial for comprehending major evolutionary transitions. Despite the ever-growing volume of sequence data, constructing reliable phylogenetic trees effectively becomes more challenging for current analytical methods. In this study, we introduce a new solution to accelerate the integration of novel taxa into an existing phylogenetic tree using a pretrained DNA language model. Our approach identifies the taxonomic unit of a newly collected sequence using existing taxonomic classification systems and updates the corresponding subtree. Specifically, we leverage a pretrained BERT network to obtain high-dimensional sequence representations, which are used not only to determine the subtree to be updated, but also identify potentially valuable regions for subtree construction. We demonstrate the effectiveness of our method, named PhyloTune, through experiments on simulated datasets, as well as our curated Plant (focusing on Embryophyta) and microbial (focusing on Bordetella genus) datasets. Our findings provide evidence that phylogenetic trees can be constructed by automatically selecting the most informative regions of sequences, without manual selection of molecular markers. This discovery offers a guide for further research into the functional aspects of different regions of DNA sequences, enriching our understanding of biology.
了解物种之间的系统发育关系对于理解主要的进化转变至关重要。尽管序列数据量不断增加,但对于当前的分析方法来说,有效地构建可靠的系统发育树变得更具挑战性。在本研究中,我们引入了一种新的解决方案,使用预训练的DNA语言模型加速将新分类群整合到现有的系统发育树中。我们的方法使用现有的分类系统识别新收集序列的分类单元,并更新相应的子树。具体来说,我们利用预训练的BERT网络获得高维序列表示,这些表示不仅用于确定要更新的子树,还用于识别子树构建中潜在有价值的区域。我们通过对模拟数据集以及我们精心整理的植物(专注于胚植物)和微生物(专注于博德特氏菌属)数据集进行实验,证明了我们名为PhyloTune的方法的有效性。我们的研究结果表明,可以通过自动选择序列中信息最丰富的区域来构建系统发育树,而无需人工选择分子标记。这一发现为进一步研究DNA序列不同区域的功能方面提供了指导,丰富了我们对生物学的理解。