Duan Chenrui, Zang Zelin, Xu Yongjie, He Hang, Li Siyuan, Liu Zihan, Lei Zhen, Zheng Ju-Sheng, Li Stan Z
College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China.
School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China.
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf149.
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT's capability for functional recognition and its biological relevance in metagenomic research.
宏基因组数据由多种物种的混合基因组组成,在海洋和土壤等各种环境中普遍存在,对人类健康和生态功能有重大影响。然而,目前的研究依赖于K-mer,这限制了对结构和功能相关基因上下文的捕捉。此外,这些方法在编码具有生物学意义的基因方面存在困难,并且无法解决宏基因组数据中固有的一对多和多对一关系。为了克服这些挑战,我们引入了FGeneBERT,这是一种新型的宏基因组预训练模型,它采用基于蛋白质的基因表示作为上下文感知和结构相关的分词器。FGeneBERT结合了掩码基因建模,以增强对基因间上下文关系的理解,并采用三联体增强的宏基因组对比学习来阐明基因序列与功能的关系。在超过1亿个宏基因组序列上进行预训练后,FGeneBERT在宏基因组数据集的四个层面上都表现出卓越性能,涵盖基因、功能、细菌和环境层面,输入序列从1到213k不等。对ATP合酶和基因操纵子的案例研究突出了FGeneBERT在宏基因组研究中的功能识别能力及其生物学相关性。