School of Natural Sciences, University of Tasmania, Australia.
School of Natural Sciences, University of Tasmania, Australia.
J Struct Biol. 2022 Sep;214(3):107870. doi: 10.1016/j.jsb.2022.107870. Epub 2022 May 29.
Discovery of new folds in the Protein Data Bank (PDB) has all but ceased. This could be viewed as evidence that all existing protein folds have been documented. Sampling bias has, however, been presented as an alternative explanation. Furthermore, although we may know of all protein folds that do exist, we may not have documented all protein folds that could exist. While addressing completeness in the context of entire protein structures is extremely difficult, they can be simplified in a number of ways. One such simplification is presented: considering protein structures as a series of α helices and β sheets and analysing the geometric relationships between these successive secondary structure elements (SSEs) through torsion angles, lengths and distances. We aimed to find out whether all substructures that could be formed by triplets of these successive SSEs were represented in the PDB. When SSEs were defined with the assignment program Promotif, a gap was identified in the represented torsion angles of helix-strand-strand substructures. This was not present when SSEs were defined with an alternative assignment program with a smaller minimum SSE length, DSSP. We also looked at representing proteins as one-dimensional sequences of SSE types and searched for underrepresented motifs. Completely absent motifs occurred more often than expected at random. If a gap in SSE substructure space exists that could be filled or if a physically possible SSE motif is absent, associated gaps in protein structure space are implied, meaning that the PDB as we know it may not be complete.
蛋白质数据库 (PDB) 中新型折叠结构的发现几乎已经停止。这可能表明所有现有的蛋白质折叠结构都已被记录。然而,采样偏差也被提出作为另一种解释。此外,尽管我们可能知道所有现有的蛋白质折叠结构,但我们可能没有记录所有可能存在的蛋白质折叠结构。虽然在整个蛋白质结构的背景下考虑完整性极其困难,但它们可以通过多种方式简化。其中一种简化方法是:将蛋白质结构视为一系列α螺旋和β折叠,并通过扭转角、长度和距离分析这些连续的二级结构元件 (SSE) 之间的几何关系。我们旨在确定这些连续的 SSE 可以形成的所有亚结构是否都存在于 PDB 中。当使用分配程序 Promotif 定义 SSE 时,发现螺旋-链-链亚结构的代表扭转角存在差距。当使用具有较小最小 SSE 长度的替代分配程序 DSSP 定义 SSE 时,这种情况不存在。我们还研究了将蛋白质表示为 SSE 类型的一维序列,并搜索代表性不足的基序。完全不存在的基序比随机出现的频率更高。如果 SSE 亚结构空间存在可以填补的差距,或者物理上可能的 SSE 基序缺失,则意味着蛋白质结构空间存在相关的差距,这意味着我们所知道的 PDB 可能不完整。