• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用氨基酸序列和结构字母表,基于机器学习预测蛋白质结构。

Machine learning-based prediction of proteins' architecture using sequences of amino acids and structural alphabets.

作者信息

Abbass Jad, Parisi Charles

机构信息

School of Computer Science and Mathematics, Kingston University, London, UK.

Telecom Physique Strasbourg, Strasbourg University, Strasbourg, France.

出版信息

J Biomol Struct Dyn. 2024 Mar 20:1-16. doi: 10.1080/07391102.2024.2328736.

DOI:10.1080/07391102.2024.2328736
PMID:38505995
Abstract

In addition to the growth of protein structures generated through wet laboratory experiments and deposited in the PDB repository, AlphaFold predictions have significantly contributed to the creation of a much larger database of protein structures. Annotating such a vast number of structures has become an increasingly challenging task. CATH is widely recognized as one the most common platforms for addressing this challenge, as it classifies proteins based on their structural and evolutionary relationships, offering the scientific community an invaluable resource for uncovering various properties, including functional annotations. While CATH annotation involves - to some extent - human intervention, keeping up with the classification of the rapidly expanding repositories of protein structures has become exceedingly difficult. Therefore, there is a pressing need for a fully automated approach. On the other hand, the abundance of protein sequences stemming from next generation sequencing technologies, lacking structural annotations, presents an additional challenge to the scientific community. Consequently, 'pre-annotating' protein sequences with structural features, ensuring a high level of precision, could prove highly advantageous. In this paper, after a thorough investigation, we introduce a novel machine-learning model capable of classifying any protein domain, whether it has a known structure or not, into one of the 40 main CATH Architectures. We achieve an F1 Score of 0.92 using only the amino acid sequence and a score of 0.94 using both the sequence of amino acids and the sequence of structural alphabets.

摘要

除了通过湿实验室实验生成并存储在蛋白质数据银行(PDB)库中的蛋白质结构增长外,AlphaFold预测对创建一个大得多的蛋白质结构数据库也有显著贡献。注释如此大量的结构已成为一项越来越具有挑战性的任务。CATH被广泛认为是应对这一挑战的最常用平台之一,因为它根据蛋白质的结构和进化关系对其进行分类,为科学界提供了一个用于揭示各种特性(包括功能注释)的宝贵资源。虽然CATH注释在一定程度上涉及人工干预,但跟上快速扩展的蛋白质结构库的分类变得极其困难。因此,迫切需要一种完全自动化的方法。另一方面,来自下一代测序技术的大量缺乏结构注释的蛋白质序列给科学界带来了额外的挑战。因此,用结构特征“预注释”蛋白质序列并确保高精度可能会被证明非常有利。在本文中,经过深入研究,我们引入了一种新颖的机器学习模型,该模型能够将任何蛋白质结构域(无论其是否具有已知结构)分类到40种主要的CATH结构之一中。仅使用氨基酸序列时,我们实现了0.92的F1分数,同时使用氨基酸序列和结构字母序列时,分数为0.94。

相似文献

1
Machine learning-based prediction of proteins' architecture using sequences of amino acids and structural alphabets.利用氨基酸序列和结构字母表,基于机器学习预测蛋白质结构。
J Biomol Struct Dyn. 2024 Mar 20:1-16. doi: 10.1080/07391102.2024.2328736.
2
Prediction of structural alphabet protein blocks using data mining.基于数据挖掘的结构字母蛋白质块预测。
Biochimie. 2022 Jun;197:74-85. doi: 10.1016/j.biochi.2022.01.019. Epub 2022 Feb 7.
3
Automatic classification of protein structures using low-dimensional structure space mappings.利用低维结构空间映射对蛋白质结构进行自动分类。
BMC Bioinformatics. 2014;15 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2105-15-S2-S1. Epub 2014 Jan 24.
4
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法:一种用于判别式多类别蛋白质折叠和超家族识别的工具。
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
5
CATH: increased structural coverage of functional space.CATH:增加功能空间的结构覆盖率。
Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.
6
CATH: expanding the horizons of structure-based functional annotations for genome sequences.CATH:扩展基于结构的基因组序列功能注释的视野。
Nucleic Acids Res. 2019 Jan 8;47(D1):D280-D284. doi: 10.1093/nar/gky1097.
7
CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds.CATH 2024:CATH-AlphaFlow 将 CATH 中的结构数量增加了一倍,并揭示了近 200 个新结构折叠类型。
J Mol Biol. 2024 Sep 1;436(17):168551. doi: 10.1016/j.jmb.2024.168551. Epub 2024 Mar 27.
8
Automated alphabet reduction for protein datasets.蛋白质数据集的自动字母缩减
BMC Bioinformatics. 2009 Jan 6;10:6. doi: 10.1186/1471-2105-10-6.
9
CATH v4.4: major expansion of CATH by experimental and predicted structural data.CATH v4.4:通过实验和预测结构数据对CATH进行重大扩展。
Nucleic Acids Res. 2025 Jan 6;53(D1):D348-D355. doi: 10.1093/nar/gkae1087.
10
Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data.Prop3D:一个灵活的、基于 Python 的机器学习平台,用于处理蛋白质结构性质和生物物理数据。
BMC Bioinformatics. 2024 Jan 4;25(1):11. doi: 10.1186/s12859-023-05586-5.