Suppr超能文献

大规模使用层次方法对蛋白质功能家族进行聚类。

Clustering protein functional families at large scale with hierarchical approaches.

机构信息

Institute of Structural and Molecular Biology, University College London, London, UK.

Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain.

出版信息

Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.

Abstract

Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.

摘要

蛋白质是细胞活动的基础,其功能和进化可以通过其结构和序列来揭示。CATH 功能家族(FunFams)是蛋白质结构域序列的连贯聚类,其成员之间的功能是保守的。MGnify 或 AlphaFold Database 等大规模存储库使蛋白质数据的规模和复杂性不断增加,这需要更强大的方法来适应这些新资源的规模。在这项工作中,我们引入了 MARC 和 FRAN 这两种算法,它们是在我们最初的方法 GeMMA/FunFHMMER 的基础上开发的,旨在解决其局限性。我们还介绍了 CATH-eMMA,它使用嵌入或 Foldseek 距离从距离矩阵形成关系树,从而降低计算需求并有效地处理各种数据类型。CATH-eMMA 为大规模聚类蛋白质功能提供了一种高度稳健且速度更快的工具,为未来的蛋白质功能和进化研究提供了新的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/a36b8eebdfb0/PRO-33-e5140-g004.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验