Suppr超能文献

使用蛋白质语言模型进行可解释的腺苷化结构域特异性预测。

Interpretable adenylation domain specificity prediction using protein language models.

作者信息

Adduri Abhinav K, McNutt Andrew T, Ellington Caleb N, Suraparaju Krish, Fang Nan, Yan Donghui, Krummenacher Benjamin, Li Sitong, Bodden Camilla, Xing Eric P, Behsaz Bahar, Koes David, Mohimani Hosein

机构信息

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.

Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.

出版信息

bioRxiv. 2025 Jan 18:2025.01.13.632878. doi: 10.1101/2025.01.13.632878.

Abstract

Natural products have long been a rich source of diverse and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressant activities. However, discovering these compounds through traditional bioactivity-guided techniques is costly and time-consuming, often resulting in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential to produce novel natural products. Adenylation domains play a key role in the biosynthesis of NRPs and NRP-PKs by recruiting substrates to incrementally build the final structure. We propose MASPR, a machine learning method that leverages protein language models for accurate and interpretable predictions of A-domain substrate specificities. MASPR demonstrates superior accuracy and generalization over existing methods and is capable of predicting substrates not present in its training data, or zero-shot classification. We use MASPR to develop Seq2Hybrid, an efficient algorithm to predict the structure of hybrid NRP-PK natural products from microbial genomes. Using Seq2Hybrid, we propose putative biosynthetic gene clusters for the orphan natural products Octaminomycin A, Dityromycin, SW-163B, and JBIR-39.

摘要

天然产物长期以来一直是各种具有临床疗效的候选药物的丰富来源。非核糖体肽(NRP)、聚酮化合物(PK)和NRP-PK杂合体是三类具有广泛生物活性的天然产物,包括抗生素、抗真菌、抗癌和免疫抑制活性。然而,通过传统的生物活性导向技术发现这些化合物成本高且耗时,常常导致已知分子的重新发现。因此,基因组挖掘已成为一种高通量策略,用于筛选数十万微生物基因组,以确定它们产生新型天然产物的潜力。腺苷化结构域在NRP和NRP-PK的生物合成中起着关键作用,通过招募底物逐步构建最终结构。我们提出了MASPR,这是一种利用蛋白质语言模型对A结构域底物特异性进行准确且可解释预测的机器学习方法。与现有方法相比,MASPR具有更高的准确性和泛化能力,并且能够预测其训练数据中不存在的底物,即零样本分类。我们使用MASPR开发了Seq2Hybrid,这是一种从微生物基因组预测杂合NRP-PK天然产物结构的高效算法。使用Seq2Hybrid,我们提出了孤儿天然产物Octaminomycin A、双杀霉素、SW-163B和JBIR-39的假定生物合成基因簇。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fc1/11761653/0884b1071ab1/nihpp-2025.01.13.632878v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验