使用序列嵌入对功能家族进行聚类可提高酶委员会（EC）纯度。

Clustering FunFams using sequence embeddings improves EC purity.

作者信息

Littmann Maria, Bordin Nicola, Heinzinger Michael, Schütze Konstantin, Dallago Christian, Orengo Christine, Rost Burkhard

机构信息

Department of Informatics, Bioinformatics & Computational Biology-i12, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.

Center for Doctoral Studies in Informatics and its Applications (CeDoSIA), TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), 85748 Garching/Munich, Germany.

出版信息

Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.

DOI:10.1093/bioinformatics/btab371

PMID:33978744

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8545299/

Abstract

MOTIVATION

Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations.

RESULTS

We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.

AVAILABILITY AND IMPLEMENTATION

Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

将蛋白质分类到功能家族中可以增进我们对蛋白质功能的理解，并允许在一个家族内转移注释。为此，功能家族需要是“纯净的”，即仅包含具有相同功能的蛋白质。功能家族（FunFams）将CATH超家族中的蛋白质聚类为共享功能的蛋白质组。所有FunFams的11%（203639个中的22830个）包含酶委员会（EC）注释，其中7%（22830个中的1526个）具有不一致的功能注释。

结果

我们提出了一种方法，通过嵌入对FunFams的序列进行编码，进一步将其聚类为功能上更一致的子家族。这些嵌入源自语言模型，该模型转移了从预测序列中缺失氨基酸（ProtBERT）获得的知识，并进一步进行了优化，以区分属于相同或不同CATH超家族的蛋白质（PB-Tucker）。使用嵌入之间的距离和DBSCAN对FunFams进行聚类并识别异常值，与随机聚类相比，每个FunFam的纯净簇数量增加了一倍。我们的方法不仅限于FunFams，在仅使用序列相似性创建的家族上也取得了成功。作为对EC注释的补充，我们在结合注释方面也观察到了类似的结果。因此，我们预计在功能的其他方面纯度也会提高。我们的结果有助于生成FunFams；功能一致性得到改善的聚类允许更可靠地推断注释。我们预计这种方法对于根据蛋白质表型进行的任何其他分组同样会成功。

可用性和实现

代码和嵌入可通过GitHub获取：https://github.com/Rostlab/FunFamsClustering。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/bf9ca04da9f9/btab371f1.jpg

相似文献

Clustering FunFams using sequence embeddings improves EC purity.使用序列嵌入对功能家族进行聚类可提高酶委员会（EC）纯度。

Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.

Functional classification of CATH superfamilies: a domain-based approach for protein function annotation.CATH 超家族的功能分类：一种基于结构域的蛋白质功能注释方法。

Bioinformatics. 2015 Nov 1;31(21):3460-7. doi: 10.1093/bioinformatics/btv398. Epub 2015 Jul 2.

FunFam protein families improve residue level molecular function prediction.FunFam 蛋白家族可提高残基水平的分子功能预测。

BMC Bioinformatics. 2019 Jul 18;20(1):400. doi: 10.1186/s12859-019-2988-x.

CATH functional families predict functional sites in proteins.CATH 功能家族可预测蛋白质中的功能位点。

Bioinformatics. 2021 May 23;37(8):1099-1106. doi: 10.1093/bioinformatics/btaa937.

New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.CATH 中的新功能家族（FunFams），以改进将保守功能位点映射到 3D 结构的工作。

Nucleic Acids Res. 2013 Jan;41(Database issue):D490-8. doi: 10.1093/nar/gks1211. Epub 2012 Nov 29.

CATH: increased structural coverage of functional space.CATH：增加功能空间的结构覆盖率。

Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.

Protein function prediction using domain families.利用结构域家族进行蛋白质功能预测。

BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2105-14-S3-S5. Epub 2013 Feb 28.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.CATHe：使用蛋白质语言模型的嵌入来检测 CATH 超家族的远程同源物。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad029.

Clustering protein functional families at large scale with hierarchical approaches.大规模使用层次方法对蛋白质功能家族进行聚类。

Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.

Assigning protein function from domain-function associations using DomFun.基于域-功能关联来分配蛋白质功能，使用 DomFun。

BMC Bioinformatics. 2022 Jan 15;23(1):43. doi: 10.1186/s12859-022-04565-6.

引用本文的文献

Enhancing missense variant pathogenicity prediction with protein language models using VariPred.利用 VariPred 利用蛋白质语言模型增强错义变异致病性预测。

Sci Rep. 2024 Apr 7;14(1):8136. doi: 10.1038/s41598-024-51489-7.

KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units.KinFams：使用 CATH 功能单元对蛋白激酶进行从头分类

Biomolecules. 2023 Feb 2;13(2):277. doi: 10.3390/biom13020277.

Novel machine learning approaches revolutionize protein knowledge.新型机器学习方法彻底改变了蛋白质知识。

Trends Biochem Sci. 2023 Apr;48(4):345-359. doi: 10.1016/j.tibs.2022.11.001. Epub 2022 Dec 9.

Nearest neighbor search on embeddings rapidly identifies distant protein relations.对嵌入进行最近邻搜索可快速识别远距离蛋白质关系。

Front Bioinform. 2022 Nov 17;2:1033775. doi: 10.3389/fbinf.2022.1033775. eCollection 2022.

LambdaPP: Fast and accessible protein-specific phenotype predictions.LambdaPP：快速且易于使用的蛋白质特异性表型预测。

Protein Sci. 2023 Jan;32(1):e4524. doi: 10.1002/pro.4524.

General strategies for using amino acid sequence data to guide biochemical investigation of protein function.利用氨基酸序列数据指导蛋白质功能的生化研究的一般策略。

Biochem Soc Trans. 2022 Dec 16;50(6):1847-1858. doi: 10.1042/BST20220849.

SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。

Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.

Organizing the bacterial annotation space with amino acid sequence embeddings.利用氨基酸序列嵌入来组织细菌注释空间。

BMC Bioinformatics. 2022 Sep 23;23(1):385. doi: 10.1186/s12859-022-04930-5.

A roadmap for the functional annotation of protein families: a community perspective.蛋白质家族功能注释的路线图：社区视角。

Database (Oxford). 2022 Aug 12;2022. doi: 10.1093/database/baac062.

Contrastive learning on protein embeddings enlightens midnight zone.蛋白质嵌入的对比学习照亮了午夜区。

NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.

本文引用的文献

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets.深度学习提取的学习特征可用于可视化和预测蛋白质组。

Curr Protoc. 2021 May;1(5):e113. doi: 10.1002/cpz1.113.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Embeddings from deep learning transfer GO annotations beyond homology.深度学习的嵌入信息可以将 GO 注释扩展到同源之外。

Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.

Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

CATH: increased structural coverage of functional space.CATH：增加功能空间的结构覆盖率。

Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.

CATH functional families predict functional sites in proteins.CATH 功能家族可预测蛋白质中的功能位点。

Bioinformatics. 2021 May 23;37(8):1099-1106. doi: 10.1093/bioinformatics/btaa937.

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function.无监督蛋白质嵌入在预测分子功能方面优于手工制作的序列和结构特征。

Bioinformatics. 2021 Apr 19;37(2):162-170. doi: 10.1093/bioinformatics/btaa701.

UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase.UniRule：UniProt 知识库中自动注释的统一规则资源。

Bioinformatics. 2020 Nov 1;36(17):4643-4648. doi: 10.1093/bioinformatics/btaa485.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称，通过实验筛选，提高了数百个基因的蛋白质功能预测和新的功能注释。

Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用序列嵌入对功能家族进行聚类可提高酶委员会（EC）纯度。

Clustering FunFams using sequence embeddings improves EC purity.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献