• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大规模使用层次方法对蛋白质功能家族进行聚类。

Clustering protein functional families at large scale with hierarchical approaches.

机构信息

Institute of Structural and Molecular Biology, University College London, London, UK.

Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain.

出版信息

Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.

DOI:10.1002/pro.5140
PMID:39145441
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11325189/
Abstract

Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.

摘要

蛋白质是细胞活动的基础,其功能和进化可以通过其结构和序列来揭示。CATH 功能家族(FunFams)是蛋白质结构域序列的连贯聚类,其成员之间的功能是保守的。MGnify 或 AlphaFold Database 等大规模存储库使蛋白质数据的规模和复杂性不断增加,这需要更强大的方法来适应这些新资源的规模。在这项工作中,我们引入了 MARC 和 FRAN 这两种算法,它们是在我们最初的方法 GeMMA/FunFHMMER 的基础上开发的,旨在解决其局限性。我们还介绍了 CATH-eMMA,它使用嵌入或 Foldseek 距离从距离矩阵形成关系树,从而降低计算需求并有效地处理各种数据类型。CATH-eMMA 为大规模聚类蛋白质功能提供了一种高度稳健且速度更快的工具,为未来的蛋白质功能和进化研究提供了新的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/7e23dac9707f/PRO-33-e5140-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/a36b8eebdfb0/PRO-33-e5140-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/747a47085cfb/PRO-33-e5140-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/4f1687ff48ad/PRO-33-e5140-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/7a1b7df6b8e1/PRO-33-e5140-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/7e23dac9707f/PRO-33-e5140-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/a36b8eebdfb0/PRO-33-e5140-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/747a47085cfb/PRO-33-e5140-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/4f1687ff48ad/PRO-33-e5140-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/7a1b7df6b8e1/PRO-33-e5140-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c0c3/11325189/7e23dac9707f/PRO-33-e5140-g003.jpg

相似文献

1
Clustering protein functional families at large scale with hierarchical approaches.大规模使用层次方法对蛋白质功能家族进行聚类。
Protein Sci. 2024 Sep;33(9):e5140. doi: 10.1002/pro.5140.
2
Functional classification of CATH superfamilies: a domain-based approach for protein function annotation.CATH 超家族的功能分类:一种基于结构域的蛋白质功能注释方法。
Bioinformatics. 2015 Nov 1;31(21):3460-7. doi: 10.1093/bioinformatics/btv398. Epub 2015 Jul 2.
3
CATH: increased structural coverage of functional space.CATH:增加功能空间的结构覆盖率。
Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.
4
CATH v4.4: major expansion of CATH by experimental and predicted structural data.CATH v4.4:通过实验和预测结构数据对CATH进行重大扩展。
Nucleic Acids Res. 2025 Jan 6;53(D1):D348-D355. doi: 10.1093/nar/gkae1087.
5
Clustering FunFams using sequence embeddings improves EC purity.使用序列嵌入对功能家族进行聚类可提高酶委员会(EC)纯度。
Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.
6
CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds.CATH 2024:CATH-AlphaFlow 将 CATH 中的结构数量增加了一倍,并揭示了近 200 个新结构折叠类型。
J Mol Biol. 2024 Sep 1;436(17):168551. doi: 10.1016/j.jmb.2024.168551. Epub 2024 Mar 27.
7
Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.离散与连续蛋白质结构空间之间的交叉:对蛋白质结构自动分类及网络的见解。
PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.
8
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
9
New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.CATH 中的新功能家族(FunFams),以改进将保守功能位点映射到 3D 结构的工作。
Nucleic Acids Res. 2013 Jan;41(Database issue):D490-8. doi: 10.1093/nar/gks1211. Epub 2012 Nov 29.
10
Clustering predicted structures at the scale of the known protein universe.对已知蛋白质宇宙尺度的预测结构进行聚类。
Nature. 2023 Oct;622(7983):637-645. doi: 10.1038/s41586-023-06510-w. Epub 2023 Sep 13.

引用本文的文献

1
A million shades of green: understanding and harnessing plant metabolic diversity.绿色的百万种色调:理解与利用植物代谢多样性
EMBO J. 2025 Jul 3. doi: 10.1038/s44318-025-00496-z.
2
ECOD: integrating classifications of protein domains from experimental and predicted structures.ECOD:整合来自实验结构和预测结构的蛋白质结构域分类
Nucleic Acids Res. 2025 Jan 6;53(D1):D411-D418. doi: 10.1093/nar/gkae1029.

本文引用的文献

1
Clustering predicted structures at the scale of the known protein universe.对已知蛋白质宇宙尺度的预测结构进行聚类。
Nature. 2023 Oct;622(7983):637-645. doi: 10.1038/s41586-023-06510-w. Epub 2023 Sep 13.
2
Uncovering new families and folds in the natural protein universe.揭示自然蛋白质宇宙中的新家族和新折叠。
Nature. 2023 Oct;622(7983):646-653. doi: 10.1038/s41586-023-06622-3. Epub 2023 Sep 13.
3
Protein remote homology detection and structural alignment using deep learning.使用深度学习进行蛋白质远程同源检测和结构比对。
Nat Biotechnol. 2024 Jun;42(6):975-985. doi: 10.1038/s41587-023-01917-2. Epub 2023 Sep 7.
4
Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。
Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.
5
Enzyme function prediction using contrastive learning.使用对比学习进行酶功能预测。
Science. 2023 Mar 31;379(6639):1358-1363. doi: 10.1126/science.adf2465. Epub 2023 Mar 30.
6
KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units.KinFams:使用 CATH 功能单元对蛋白激酶进行从头分类
Biomolecules. 2023 Feb 2;13(2):277. doi: 10.3390/biom13020277.
7
Improved global protein homolog detection with major gains in function identification.提高全局蛋白质同源物检测的功能识别能力。
Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2211823120. doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.
8
CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.CATHe:使用蛋白质语言模型的嵌入来检测 CATH 超家族的远程同源物。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad029.
9
Novel machine learning approaches revolutionize protein knowledge.新型机器学习方法彻底改变了蛋白质知识。
Trends Biochem Sci. 2023 Apr;48(4):345-359. doi: 10.1016/j.tibs.2022.11.001. Epub 2022 Dec 9.
10
Contrastive learning on protein embeddings enlightens midnight zone.蛋白质嵌入的对比学习照亮了午夜区。
NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.