Institute of Structural and Molecular Biology, University College London, London, UK.
Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain.
Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
蛋白质是细胞活动的基础,其功能和进化可以通过其结构和序列来揭示。CATH 功能家族(FunFams)是蛋白质结构域序列的连贯聚类,其成员之间的功能是保守的。MGnify 或 AlphaFold Database 等大规模存储库使蛋白质数据的规模和复杂性不断增加,这需要更强大的方法来适应这些新资源的规模。在这项工作中,我们引入了 MARC 和 FRAN 这两种算法,它们是在我们最初的方法 GeMMA/FunFHMMER 的基础上开发的,旨在解决其局限性。我们还介绍了 CATH-eMMA,它使用嵌入或 Foldseek 距离从距离矩阵形成关系树,从而降低计算需求并有效地处理各种数据类型。CATH-eMMA 为大规模聚类蛋白质功能提供了一种高度稳健且速度更快的工具,为未来的蛋白质功能和进化研究提供了新的工具。