Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
Advanced Research Computing Centre, University College London, London, United Kingdom.
J Mol Biol. 2024 Sep 1;436(17):168551. doi: 10.1016/j.jmb.2024.168551. Epub 2024 Mar 27.
CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.
CATH(https://www.cathdb.info)对 PDB 中的实验蛋白质结构和 AlphaFold Database(AFDB)中的预测结构进行结构域分类。为了应对预测数据的规模,开发了一种新的 NextFlow 工作流程(CATH-AlphaFlow),用于将高质量结构域分类为 CATH 超家族,并识别新的折叠群和超家族。CATH-AlphaFlow 使用一种新颖的基于结构的结构域边界预测方法(ChainSaw)来识别多结构域蛋白中的结构域。我们应用 CATH-AlphaFlow 处理未在 CATH 和 AFDB 中分类的 PDB 结构,涵盖了 21 种模式生物的结构,使 CATH 的规模扩大了 100%以上。未在现有 CATH 超家族或折叠群中分类的结构域用于播种新的折叠,从 PDB 结构(2023 年 9 月发布)中得到 253 个新折叠,从 21 种模式生物的 AFDB 结构中得到 96 个。在可能的情况下,使用以下方法获得功能注释:(i)公开可用方法的预测;(ii)AFDB/UniProt50 中结构相关物的注释。我们还预测了功能位点和高度保守的残基。一些折叠与重要功能相关,如光合作用适应(在开花植物中)、铁渗透酶活性(在真菌中)和产后精子发生(在老鼠中)。CATH-AlphaFlow 将使我们能够在 AFDB 中识别更多的 CATH 相关物,进一步描述蛋白质结构景观。