Abbass Jad, Parisi Charles
School of Computer Science and Mathematics, Kingston University, London, UK.
Telecom Physique Strasbourg, Strasbourg University, Strasbourg, France.
J Biomol Struct Dyn. 2024 Mar 20:1-16. doi: 10.1080/07391102.2024.2328736.
In addition to the growth of protein structures generated through wet laboratory experiments and deposited in the PDB repository, AlphaFold predictions have significantly contributed to the creation of a much larger database of protein structures. Annotating such a vast number of structures has become an increasingly challenging task. CATH is widely recognized as one the most common platforms for addressing this challenge, as it classifies proteins based on their structural and evolutionary relationships, offering the scientific community an invaluable resource for uncovering various properties, including functional annotations. While CATH annotation involves - to some extent - human intervention, keeping up with the classification of the rapidly expanding repositories of protein structures has become exceedingly difficult. Therefore, there is a pressing need for a fully automated approach. On the other hand, the abundance of protein sequences stemming from next generation sequencing technologies, lacking structural annotations, presents an additional challenge to the scientific community. Consequently, 'pre-annotating' protein sequences with structural features, ensuring a high level of precision, could prove highly advantageous. In this paper, after a thorough investigation, we introduce a novel machine-learning model capable of classifying any protein domain, whether it has a known structure or not, into one of the 40 main CATH Architectures. We achieve an F1 Score of 0.92 using only the amino acid sequence and a score of 0.94 using both the sequence of amino acids and the sequence of structural alphabets.
除了通过湿实验室实验生成并存储在蛋白质数据银行(PDB)库中的蛋白质结构增长外,AlphaFold预测对创建一个大得多的蛋白质结构数据库也有显著贡献。注释如此大量的结构已成为一项越来越具有挑战性的任务。CATH被广泛认为是应对这一挑战的最常用平台之一,因为它根据蛋白质的结构和进化关系对其进行分类,为科学界提供了一个用于揭示各种特性(包括功能注释)的宝贵资源。虽然CATH注释在一定程度上涉及人工干预,但跟上快速扩展的蛋白质结构库的分类变得极其困难。因此,迫切需要一种完全自动化的方法。另一方面,来自下一代测序技术的大量缺乏结构注释的蛋白质序列给科学界带来了额外的挑战。因此,用结构特征“预注释”蛋白质序列并确保高精度可能会被证明非常有利。在本文中,经过深入研究,我们引入了一种新颖的机器学习模型,该模型能够将任何蛋白质结构域(无论其是否具有已知结构)分类到40种主要的CATH结构之一中。仅使用氨基酸序列时,我们实现了0.92的F1分数,同时使用氨基酸序列和结构字母序列时,分数为0.94。