Lai Qilong, Yao Shuai, Zha Yuguo, Zhang Haohong, Zhang Haobo, Ye Ying, Zhang Yonghui, Bai Hong, Ning Kang
MOE Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.
Hubei Key Laboratory of Natural Medicinal Chemistry and Resource Evaluation, School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei, China.
Nucleic Acids Res. 2025 Apr 10;53(7). doi: 10.1093/nar/gkaf305.
Biosynthetic gene clusters (BGCs), key in synthesizing microbial secondary metabolites, are mostly hidden in microbial genomes and metagenomes. To unearth this vast potential, we present BGC-Prophet, a transformer-based language model for BGC prediction and classification. Leveraging the transformer encoder, BGC-Prophet captures location-dependent relationships between genes. As one of the pioneering ultrahigh-throughput tools, BGC-Prophet significantly surpasses existing methods in efficiency and fidelity, enabling comprehensive pan-phylogenetic and whole-metagenome BGC screening. Through the analysis of 85 203 genomes and 9428 metagenomes, BGC-Prophet has profiled an extensive array of sub-million BGCs. It highlights notable enrichment in phyla like Actinomycetota and the widespread distribution of polyketide, NRP, and RiPP BGCs across diverse lineages. It reveals enrichment patterns of BGCs following important geological events, suggesting environmental influences on BGC evolution. BGC-Prophet's capabilities in detection of BGCs and evolutionary patterns offer contributions to deeper understanding of microbial secondary metabolites and application in synthetic biology.
生物合成基因簇(BGCs)是合成微生物次级代谢产物的关键,大多隐藏于微生物基因组和宏基因组中。为挖掘这一巨大潜力,我们推出了BGC-Prophet,这是一种基于Transformer的用于BGC预测和分类的语言模型。利用Transformer编码器,BGC-Prophet捕捉基因之间的位置依赖关系。作为开创性的超高通量工具之一,BGC-Prophet在效率和保真度方面显著超越现有方法,能够进行全面的泛系统发育和全宏基因组BGC筛选。通过对85203个基因组和9428个宏基因组的分析,BGC-Prophet描绘了大量规模达数百万以下的BGCs。它突出了放线菌门等门类中的显著富集以及聚酮化合物、非核糖体肽和核糖体合成和翻译后修饰肽BGCs在不同谱系中的广泛分布。它揭示了重要地质事件后BGCs的富集模式,表明环境对BGC进化的影响。BGC-Prophet在BGC检测和进化模式方面的能力有助于更深入地了解微生物次级代谢产物并应用于合成生物学。