• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SAFPred:利用蛋白质嵌入进行细菌的基因功能预测

SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings.

机构信息

Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands.

Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

出版信息

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae328.

DOI:10.1093/bioinformatics/btae328
PMID:38775729
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11147799/
Abstract

MOTIVATION

Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.

RESULTS

To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.

AVAILABILITY AND IMPLEMENTATION

https://github.com/AbeelLab/safpred.

摘要

动机

如今,我们仅能了解从基因组数据预测的蛋白质序列中的一小部分的功能。对于细菌来说,这个问题更加突出,因为它们代表了地球上在系统发生和代谢上最多样化的分类群之一。大多数功能预测算法都集中在真核生物上,而传统的注释方法依赖于现有数据库中存在相似序列,这使得细菌的基因注释率更低。然而,对于新的细菌蛋白质,通常不存在这样的序列。因此,我们需要针对细菌改进功能预测方法。最近,基于自然语言处理领域的变压器语言模型被用于获取蛋白质的新表示形式,以替代氨基酸序列。这些表示形式,称为蛋白质嵌入,已被证明对改善真核生物的注释有很大的帮助,但在细菌基因组上的应用却非常有限。

结果

为了预测细菌的基因功能,我们开发了 SAFPred,这是一种基于最先进的蛋白质语言模型的蛋白质嵌入的新的同源性感知基因功能预测工具。SAFPred 还通过保守的同线性利用了细菌独特的操纵子结构。SAFPred 在多个细菌物种上的表现均优于传统的基于序列的注释方法和最先进的方法,包括在远源同源检测方面,其与训练集中蛋白质的序列相似性低至 40%。使用 SAFPred 来识别不同肠球菌中的基因功能,其中一些物种是主要的临床威胁,我们鉴定出 11 个以前未被识别的潜在新型毒素,它们可能对人类和动物健康有重要意义。

可用性和实现

https://github.com/AbeelLab/safpred。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/f42f544ffa58/btae328f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/5eb2f91a4caf/btae328f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/92f60aed338b/btae328f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/f42f544ffa58/btae328f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/5eb2f91a4caf/btae328f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/92f60aed338b/btae328f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0c6/11147799/f42f544ffa58/btae328f3.jpg

相似文献

1
SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings.SAFPred:利用蛋白质嵌入进行细菌的基因功能预测
Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae328.
2
SAP: Synteny-aware gene function prediction for bacteria using protein embeddings.SAP:利用蛋白质嵌入对细菌进行共线性感知基因功能预测。
bioRxiv. 2023 Nov 21:2023.05.02.539034. doi: 10.1101/2023.05.02.539034.
3
Beav: a bacterial genome and mobile element annotation pipeline.Beav:细菌基因组和移动元件注释流水线。
mSphere. 2024 Aug 28;9(8):e0020924. doi: 10.1128/msphere.00209-24. Epub 2024 Jul 22.
4
Fine-tuning protein embeddings for functional similarity evaluation.调整蛋白质嵌入以进行功能相似性评估。
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae445.
5
6
Organizing the bacterial annotation space with amino acid sequence embeddings.利用氨基酸序列嵌入来组织细菌注释空间。
BMC Bioinformatics. 2022 Sep 23;23(1):385. doi: 10.1186/s12859-022-04930-5.
7
BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins.BLANNOTATOR:基于同源性的细菌蛋白功能增强预测。
BMC Bioinformatics. 2012 Feb 15;13:33. doi: 10.1186/1471-2105-13-33.
8
Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment.利用 Chou 的 5 步规则,通过基于基因本体论注释和序列比对的多标签学习,预测革兰氏阴性和革兰氏阳性细菌蛋白质的亚细胞定位。
J Integr Bioinform. 2020 Jun 29;18(1):51-79. doi: 10.1515/jib-2019-0091.
9
SynGAP: a synteny-based toolkit for gene structure annotation polishing.SynGAP:基于基因结构注释优化的同线性分析工具包。
Genome Biol. 2024 Aug 13;25(1):218. doi: 10.1186/s13059-024-03359-8.
10
learnMSA2: deep protein multiple alignments with large language and hidden Markov models.learnMSA2:基于大型语言模型和隐马尔可夫模型的深度蛋白质多重比对。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.

本文引用的文献

1
Global diversity of enterococci and description of 18 previously unknown species.肠球菌的全球多样性及其 18 种未知新种的描述。
Proc Natl Acad Sci U S A. 2024 Mar 5;121(10):e2310852121. doi: 10.1073/pnas.2310852121. Epub 2024 Feb 28.
2
Learning from the unknown: exploring the range of bacterial functionality.从未知中学习:探索细菌功能的范围。
Nucleic Acids Res. 2023 Oct 27;51(19):10162-10175. doi: 10.1093/nar/gkad757.
3
Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。
Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.
4
Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
5
InterPro in 2022.InterPro 在 2022 年。
Nucleic Acids Res. 2023 Jan 6;51(D1):D418-D427. doi: 10.1093/nar/gkac993.
6
Contrastive learning on protein embeddings enlightens midnight zone.蛋白质嵌入的对比学习照亮了午夜区。
NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.
7
Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter.深度学习细菌和古菌的生命通用语言能够实现迁移学习并照亮微生物暗物质。
Nat Commun. 2022 May 11;13(1):2606. doi: 10.1038/s41467-022-30070-8.
8
Emerging enterococcus pore-forming toxins with MHC/HLA-I as receptors.以MHC/HLA-I为受体的新型肠球菌成孔毒素
Cell. 2022 Mar 31;185(7):1157-1171.e22. doi: 10.1016/j.cell.2022.02.002. Epub 2022 Mar 7.
9
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.GTDB:通过系统发生一致、等级归一化和基于完整基因组的分类学,对细菌和古菌多样性进行持续普查。
Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.
10
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.