Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana, USA.
Center for Data and Computing in Natural Sciences (CDCS), Institute for Computational Systems Biology, Universität Hamburg, Hamburg, Germany.
Proteins. 2022 Sep;90(9):1721-1731. doi: 10.1002/prot.26349. Epub 2022 May 2.
Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-of-the-art sequence or other 3D structure-based PSC approaches. However, existing PSN-based PSC approaches model the whole 3D structure of a protein as a static (i.e., single-layer) PSN. Because folding of a protein is a dynamic process, where some parts (i.e., sub-structures) of a protein fold before others, modeling the 3D structure of a protein as a PSN that captures the sub-structures might further help improve the existing PSC performance. Here, we propose to model 3D structures of proteins as multi-layer sequential PSNs that approximate 3D sub-structures of proteins, with the hypothesis that this will improve upon the current state-of-the-art PSC approaches that are based on single-layer PSNs (and thus upon the existing state-of-the-art sequence and other 3D structural approaches). Indeed, we confirm this on 72 datasets spanning ~44 000 CATH and SCOPe protein domains.
蛋白质结构分类(PSC)是一个监督问题,根据蛋白质的序列或 3D 结构特征,将蛋白质分配到预先定义的结构(例如 CATH 或 SCOPe)类别中。我们最近提出了一些 PSC 方法,这些方法将蛋白质 3D 结构建模为蛋白质结构网络(PSN),并分析基于 PSN 的蛋白质特征,这些方法的性能优于或可与最新的序列或其他基于 3D 结构的 PSC 方法相媲美。然而,现有的基于 PSN 的 PSC 方法将蛋白质的整个 3D 结构建模为静态(即单层)PSN。由于蛋白质的折叠是一个动态的过程,其中蛋白质的一些部分(即亚结构)先折叠,因此将蛋白质的 3D 结构建模为捕获亚结构的 PSN 可能会进一步提高现有 PSC 的性能。在这里,我们提出将蛋白质的 3D 结构建模为多层顺序 PSN,这些 PSN 近似于蛋白质的 3D 亚结构,假设这将改进基于单层 PSN 的最新 PSC 方法(因此也改进了现有的基于序列和其他 3D 结构的方法)。实际上,我们在跨越约 44000 个 CATH 和 SCOPe 蛋白质结构域的 72 个数据集上验证了这一点。