Suppr超能文献

使用序列嵌入对功能家族进行聚类可提高酶委员会(EC)纯度。

Clustering FunFams using sequence embeddings improves EC purity.

作者信息

Littmann Maria, Bordin Nicola, Heinzinger Michael, Schütze Konstantin, Dallago Christian, Orengo Christine, Rost Burkhard

机构信息

Department of Informatics, Bioinformatics & Computational Biology-i12, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.

Center for Doctoral Studies in Informatics and its Applications (CeDoSIA), TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), 85748 Garching/Munich, Germany.

出版信息

Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.

Abstract

MOTIVATION

Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations.

RESULTS

We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.

AVAILABILITY AND IMPLEMENTATION

Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

将蛋白质分类到功能家族中可以增进我们对蛋白质功能的理解,并允许在一个家族内转移注释。为此,功能家族需要是“纯净的”,即仅包含具有相同功能的蛋白质。功能家族(FunFams)将CATH超家族中的蛋白质聚类为共享功能的蛋白质组。所有FunFams的11%(203639个中的22830个)包含酶委员会(EC)注释,其中7%(22830个中的1526个)具有不一致的功能注释。

结果

我们提出了一种方法,通过嵌入对FunFams的序列进行编码,进一步将其聚类为功能上更一致的子家族。这些嵌入源自语言模型,该模型转移了从预测序列中缺失氨基酸(ProtBERT)获得的知识,并进一步进行了优化,以区分属于相同或不同CATH超家族的蛋白质(PB-Tucker)。使用嵌入之间的距离和DBSCAN对FunFams进行聚类并识别异常值,与随机聚类相比,每个FunFam的纯净簇数量增加了一倍。我们的方法不仅限于FunFams,在仅使用序列相似性创建的家族上也取得了成功。作为对EC注释的补充,我们在结合注释方面也观察到了类似的结果。因此,我们预计在功能的其他方面纯度也会提高。我们的结果有助于生成FunFams;功能一致性得到改善的聚类允许更可靠地推断注释。我们预计这种方法对于根据蛋白质表型进行的任何其他分组同样会成功。

可用性和实现

代码和嵌入可通过GitHub获取:https://github.com/Rostlab/FunFamsClustering。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4be4/8545299/bf9ca04da9f9/btab371f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验