• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

深度自监督学习在生物合成基因簇检测和产物分类中的应用。

Deep self-supervised learning for biosynthetic gene cluster detection and product classification.

机构信息

Microsoft Research New England, Cambridge, Massachusetts, United States of America.

Department of Bioengineering, Stanford University, Stanford, California, United States of America.

出版信息

PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.

DOI:10.1371/journal.pcbi.1011162
PMID:37220151
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10241353/
Abstract

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.

摘要

天然产物是构成医药行业许多治疗药物基础的化学化合物。在微生物中,天然产物是由称为生物合成基因簇 (BGC) 的聚集基因合成的。随着高通量测序技术的进步,完整的微生物分离基因组和宏基因组的数量不断增加,其中大量 BGC 尚未被发现。在这里,我们介绍了一种自监督学习方法,旨在从这些数据中识别和表征 BGC。为此,我们将 BGC 表示为功能蛋白结构域的链,并在这些结构域上训练掩蔽语言模型。我们评估了我们的方法在检测细菌基因组中的 BGC 和表征 BGC 属性方面的能力。我们还证明了我们的模型可以学习 BGC 及其组成结构域的有意义表示,检测微生物基因组中的 BGC,并预测 BGC 产物类别。这些结果突出了自监督神经网络作为改进 BGC 预测和分类的有前途的框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/cd1a5d9f5b03/pcbi.1011162.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/61c31174e8cb/pcbi.1011162.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/8a2094139627/pcbi.1011162.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/cd1a5d9f5b03/pcbi.1011162.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/61c31174e8cb/pcbi.1011162.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/8a2094139627/pcbi.1011162.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c31/10241353/cd1a5d9f5b03/pcbi.1011162.g003.jpg

相似文献

1
Deep self-supervised learning for biosynthetic gene cluster detection and product classification.深度自监督学习在生物合成基因簇检测和产物分类中的应用。
PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.
2
Mining metagenomic data to gain a new insight into the gut microbial biosynthetic potential in placental mammals.从宏基因组数据中挖掘新的见解,以了解胎盘哺乳动物肠道微生物的生物合成潜力。
Microbiol Spectr. 2024 Oct 3;12(10):e0086424. doi: 10.1128/spectrum.00864-24. Epub 2024 Aug 20.
3
Long-Read Metagenome-Assembled Genomes Improve Identification of Novel Complete Biosynthetic Gene Clusters in a Complex Microbial Activated Sludge Ecosystem.长读长基因组组装提高了复杂微生物活性污泥生态系统中新型完整生物合成基因簇的鉴定。
mSystems. 2022 Dec 20;7(6):e0063222. doi: 10.1128/msystems.00632-22. Epub 2022 Nov 29.
4
TaxiBGC: a Taxonomy-Guided Approach for Profiling Experimentally Characterized Microbial Biosynthetic Gene Clusters and Secondary Metabolite Production Potential in Metagenomes.TaxiBGC:一种基于分类学的方法,用于对宏基因组中实验表征的微生物生物合成基因簇和次生代谢产物产生潜力进行分析。
mSystems. 2022 Dec 20;7(6):e0092522. doi: 10.1128/msystems.00925-22. Epub 2022 Nov 15.
5
A deep learning genome-mining strategy for biosynthetic gene cluster prediction.深度学习基因组挖掘策略用于生物合成基因簇预测。
Nucleic Acids Res. 2019 Oct 10;47(18):e110. doi: 10.1093/nar/gkz654.
6
Deep Learning to Predict the Biosynthetic Gene Clusters in Bacterial Genomes.深度学习预测细菌基因组中的生物合成基因簇。
J Mol Biol. 2022 Aug 15;434(15):167597. doi: 10.1016/j.jmb.2022.167597. Epub 2022 May 6.
7
Predicting fungal secondary metabolite activity from biosynthetic gene cluster data using machine learning.基于生物合成基因簇数据利用机器学习预测真菌次生代谢物活性。
Microbiol Spectr. 2024 Feb 6;12(2):e0340023. doi: 10.1128/spectrum.03400-23. Epub 2024 Jan 9.
8
A rapid and efficient strategy to identify and recover biosynthetic gene clusters from soil metagenomes.一种从土壤宏基因组中快速高效鉴定和回收生物合成基因簇的策略。
Appl Microbiol Biotechnol. 2022 Apr;106(8):3293-3306. doi: 10.1007/s00253-022-11917-y. Epub 2022 Apr 18.
9
iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.iPRESTO:与特定天然产物亚结构相关的生物合成亚簇的自动发现。
PLoS Comput Biol. 2023 Feb 9;19(2):e1010462. doi: 10.1371/journal.pcbi.1010462. eCollection 2023 Feb.
10
BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters.BiG-SLiCE:一个高度可扩展的工具,可绘制 120 万个生物合成基因簇的多样性图谱。
Gigascience. 2021 Jan 13;10(1). doi: 10.1093/gigascience/giaa154.

引用本文的文献

1
Sequence modeling tools to decode the biosynthetic diversity of the human microbiome.用于解码人类微生物组生物合成多样性的序列建模工具。
mSystems. 2025 Jul 22;10(7):e0033325. doi: 10.1128/msystems.00333-25. Epub 2025 Jun 30.
2
Deciphering the biosynthetic potential of microbial genomes using a BGC language processing neural network model.使用生物合成基因簇语言处理神经网络模型解析微生物基因组的生物合成潜力。
Nucleic Acids Res. 2025 Apr 10;53(7). doi: 10.1093/nar/gkaf305.
3
Synthetic Biology in Natural Product Biosynthesis.天然产物生物合成中的合成生物学

本文引用的文献

1
Convolutions are competitive with transformers for protein sequence pretraining.卷积运算在蛋白质序列预训练方面与转换器竞争。
Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.
2
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
3
ProtGPT2 is a deep unsupervised language model for protein design.ProtGPT2 是一个用于蛋白质设计的深度无监督语言模型。
Chem Rev. 2025 Apr 9;125(7):3814-3931. doi: 10.1021/acs.chemrev.4c00567. Epub 2025 Mar 21.
4
Recent advances in deep learning and language models for studying the microbiome.用于研究微生物组的深度学习和语言模型的最新进展。
Front Genet. 2025 Jan 7;15:1494474. doi: 10.3389/fgene.2024.1494474. eCollection 2024.
5
Exploration, expansion and definition of the atropopeptide family of ribosomally synthesized and posttranslationally modified peptides.核糖体合成及翻译后修饰肽类阿托肽家族的探索、扩展与定义
Chem Sci. 2024 Sep 10;15(42):17506-23. doi: 10.1039/d4sc03469d.
6
Exploring the secrets of marine microorganisms: Unveiling secondary metabolites through metagenomics.探索海洋微生物的奥秘:通过宏基因组学揭示次生代谢产物。
Microb Biotechnol. 2024 Aug;17(8):e14533. doi: 10.1111/1751-7915.14533.
7
Machine Learning-Enabled Genome Mining and Bioactivity Prediction of Natural Products.基于机器学习的天然产物基因组挖掘和生物活性预测。
ACS Synth Biol. 2023 Sep 15;12(9):2650-2662. doi: 10.1021/acssynbio.3c00234. Epub 2023 Aug 22.
8
From sequence to function through structure: Deep learning for protein design.从序列到功能再到结构:用于蛋白质设计的深度学习
Comput Struct Biotechnol J. 2022 Nov 19;21:238-250. doi: 10.1016/j.csbj.2022.11.014. eCollection 2023.
Nat Commun. 2022 Jul 27;13(1):4348. doi: 10.1038/s41467-022-32007-7.
4
Constructing benchmark test sets for biological sequence analysis using independent set algorithms.使用独立集算法构建生物序列分析的基准测试集。
PLoS Comput Biol. 2022 Mar 7;18(3):e1009492. doi: 10.1371/journal.pcbi.1009492. eCollection 2022 Mar.
5
Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.通过深度表示学习进行RNA结构比对和聚类的信息性RNA碱基嵌入
NAR Genom Bioinform. 2022 Feb 22;4(1):lqac012. doi: 10.1093/nargab/lqac012. eCollection 2022 Mar.
6
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
7
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
8
GRINS: Genetic elements that recode assembly-line polyketide synthases and accelerate their diversification.GRINS:重排聚酮合酶装配线并加速其多样化的遗传元件。
Proc Natl Acad Sci U S A. 2021 Jun 29;118(26). doi: 10.1073/pnas.2100751118.
9
Using graph convolutional neural networks to learn a representation for glycans.使用图卷积神经网络学习聚糖的表示。
Cell Rep. 2021 Jun 15;35(11):109251. doi: 10.1016/j.celrep.2021.109251.
10
antiSMASH 6.0: improving cluster detection and comparison capabilities.antiSMASH 6.0:提高簇检测和比较能力。
Nucleic Acids Res. 2021 Jul 2;49(W1):W29-W35. doi: 10.1093/nar/gkab335.