• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用序列嵌入对功能家族进行聚类可提高酶委员会(EC)纯度。

Clustering FunFams using sequence embeddings improves EC purity.

作者信息

Littmann Maria, Bordin Nicola, Heinzinger Michael, Schütze Konstantin, Dallago Christian, Orengo Christine, Rost Burkhard

机构信息

Department of Informatics, Bioinformatics & Computational Biology-i12, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.

Center for Doctoral Studies in Informatics and its Applications (CeDoSIA), TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), 85748 Garching/Munich, Germany.

出版信息

Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.

DOI:10.1093/bioinformatics/btab371
PMID:33978744
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8545299/
Abstract

MOTIVATION

Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations.

RESULTS

We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.

AVAILABILITY AND IMPLEMENTATION

Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

将蛋白质分类到功能家族中可以增进我们对蛋白质功能的理解,并允许在一个家族内转移注释。为此,功能家族需要是“纯净的”,即仅包含具有相同功能的蛋白质。功能家族(FunFams)将CATH超家族中的蛋白质聚类为共享功能的蛋白质组。所有FunFams的11%(203639个中的22830个)包含酶委员会(EC)注释,其中7%(22830个中的1526个)具有不一致的功能注释。

结果

我们提出了一种方法,通过嵌入对FunFams的序列进行编码,进一步将其聚类为功能上更一致的子家族。这些嵌入源自语言模型,该模型转移了从预测序列中缺失氨基酸(ProtBERT)获得的知识,并进一步进行了优化,以区分属于相同或不同CATH超家族的蛋白质(PB-Tucker)。使用嵌入之间的距离和DBSCAN对FunFams进行聚类并识别异常值,与随机聚类相比,每个FunFam的纯净簇数量增加了一倍。我们的方法不仅限于FunFams,在仅使用序列相似性创建的家族上也取得了成功。作为对EC注释的补充,我们在结合注释方面也观察到了类似的结果。因此,我们预计在功能的其他方面纯度也会提高。我们的结果有助于生成FunFams;功能一致性得到改善的聚类允许更可靠地推断注释。我们预计这种方法对于根据蛋白质表型进行的任何其他分组同样会成功。

可用性和实现

代码和嵌入可通过GitHub获取:https://github.com/Rostlab/FunFamsClustering。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/9f339bc41975/btab371f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/bf9ca04da9f9/btab371f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/725f0e99725a/btab371f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/f61b13de8c82/btab371f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/b11002f7b629/btab371f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/9f339bc41975/btab371f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/bf9ca04da9f9/btab371f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/725f0e99725a/btab371f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/f61b13de8c82/btab371f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/b11002f7b629/btab371f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/9f339bc41975/btab371f5.jpg

相似文献

1
Clustering FunFams using sequence embeddings improves EC purity.使用序列嵌入对功能家族进行聚类可提高酶委员会(EC)纯度。
Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.
2
Functional classification of CATH superfamilies: a domain-based approach for protein function annotation.CATH 超家族的功能分类:一种基于结构域的蛋白质功能注释方法。
Bioinformatics. 2015 Nov 1;31(21):3460-7. doi: 10.1093/bioinformatics/btv398. Epub 2015 Jul 2.
3
FunFam protein families improve residue level molecular function prediction.FunFam 蛋白家族可提高残基水平的分子功能预测。
BMC Bioinformatics. 2019 Jul 18;20(1):400. doi: 10.1186/s12859-019-2988-x.
4
CATH functional families predict functional sites in proteins.CATH 功能家族可预测蛋白质中的功能位点。
Bioinformatics. 2021 May 23;37(8):1099-1106. doi: 10.1093/bioinformatics/btaa937.
5
New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.CATH 中的新功能家族(FunFams),以改进将保守功能位点映射到 3D 结构的工作。
Nucleic Acids Res. 2013 Jan;41(Database issue):D490-8. doi: 10.1093/nar/gks1211. Epub 2012 Nov 29.
6
CATH: increased structural coverage of functional space.CATH:增加功能空间的结构覆盖率。
Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.
7
Protein function prediction using domain families.利用结构域家族进行蛋白质功能预测。
BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2105-14-S3-S5. Epub 2013 Feb 28.
8
CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.CATHe:使用蛋白质语言模型的嵌入来检测 CATH 超家族的远程同源物。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad029.
9
Clustering protein functional families at large scale with hierarchical approaches.大规模使用层次方法对蛋白质功能家族进行聚类。
Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.
10
Assigning protein function from domain-function associations using DomFun.基于域-功能关联来分配蛋白质功能,使用 DomFun。
BMC Bioinformatics. 2022 Jan 15;23(1):43. doi: 10.1186/s12859-022-04565-6.

引用本文的文献

1
Enhancing missense variant pathogenicity prediction with protein language models using VariPred.利用 VariPred 利用蛋白质语言模型增强错义变异致病性预测。
Sci Rep. 2024 Apr 7;14(1):8136. doi: 10.1038/s41598-024-51489-7.
2
KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units.KinFams:使用 CATH 功能单元对蛋白激酶进行从头分类
Biomolecules. 2023 Feb 2;13(2):277. doi: 10.3390/biom13020277.
3
Novel machine learning approaches revolutionize protein knowledge.新型机器学习方法彻底改变了蛋白质知识。

本文引用的文献

1
Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets.深度学习提取的学习特征可用于可视化和预测蛋白质组。
Curr Protoc. 2021 May;1(5):e113. doi: 10.1002/cpz1.113.
2
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
3
Embeddings from deep learning transfer GO annotations beyond homology.深度学习的嵌入信息可以将 GO 注释扩展到同源之外。
Trends Biochem Sci. 2023 Apr;48(4):345-359. doi: 10.1016/j.tibs.2022.11.001. Epub 2022 Dec 9.
4
Nearest neighbor search on embeddings rapidly identifies distant protein relations.对嵌入进行最近邻搜索可快速识别远距离蛋白质关系。
Front Bioinform. 2022 Nov 17;2:1033775. doi: 10.3389/fbinf.2022.1033775. eCollection 2022.
5
LambdaPP: Fast and accessible protein-specific phenotype predictions.LambdaPP:快速且易于使用的蛋白质特异性表型预测。
Protein Sci. 2023 Jan;32(1):e4524. doi: 10.1002/pro.4524.
6
General strategies for using amino acid sequence data to guide biochemical investigation of protein function.利用氨基酸序列数据指导蛋白质功能的生化研究的一般策略。
Biochem Soc Trans. 2022 Dec 16;50(6):1847-1858. doi: 10.1042/BST20220849.
7
SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。
Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.
8
Organizing the bacterial annotation space with amino acid sequence embeddings.利用氨基酸序列嵌入来组织细菌注释空间。
BMC Bioinformatics. 2022 Sep 23;23(1):385. doi: 10.1186/s12859-022-04930-5.
9
A roadmap for the functional annotation of protein families: a community perspective.蛋白质家族功能注释的路线图:社区视角。
Database (Oxford). 2022 Aug 12;2022. doi: 10.1093/database/baac062.
10
Contrastive learning on protein embeddings enlightens midnight zone.蛋白质嵌入的对比学习照亮了午夜区。
NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.
Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.
4
Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.
5
CATH: increased structural coverage of functional space.CATH:增加功能空间的结构覆盖率。
Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.
6
CATH functional families predict functional sites in proteins.CATH 功能家族可预测蛋白质中的功能位点。
Bioinformatics. 2021 May 23;37(8):1099-1106. doi: 10.1093/bioinformatics/btaa937.
7
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.无监督蛋白质嵌入在预测分子功能方面优于手工制作的序列和结构特征。
Bioinformatics. 2021 Apr 19;37(2):162-170. doi: 10.1093/bioinformatics/btaa701.
8
UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase.UniRule:UniProt 知识库中自动注释的统一规则资源。
Bioinformatics. 2020 Nov 1;36(17):4643-4648. doi: 10.1093/bioinformatics/btaa485.
9
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
10
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称,通过实验筛选,提高了数百个基因的蛋白质功能预测和新的功能注释。
Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.