Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom.
European Molecular Biology Laboratory, Hamburg Unit, Notkestrasse 85, 22607 Hamburg, Germany.
IUCrJ. 2024 Nov 1;11(Pt 6):938-950. doi: 10.1107/S2052252524009114.
The accuracy of the information in the Protein Data Bank (PDB) is of great importance for the myriad downstream applications that make use of protein structural information. Despite best efforts, the occasional introduction of errors is inevitable, especially where the experimental data are of limited resolution. A novel protein structure validation approach based on spotting inconsistencies between the residue contacts and distances observed in a structural model and those computationally predicted by methods such as AlphaFold2 has previously been established. It is particularly well suited to the detection of register errors. Importantly, this new approach is orthogonal to traditional methods based on stereochemistry or map-model agreement, and is resolution independent. Here, thousands of likely register errors are identified by scanning 3-5 Å resolution structures in the PDB. Unlike most methods, the application of this approach yields suggested corrections to the register of affected regions, which it is shown, even by limited implementation, lead to improved refinement statistics in the vast majority of cases. A few limitations and confounding factors such as fold-switching proteins are characterized, but this approach is expected to have broad application in spotting potential issues in current accessions and, through its implementation and distribution in CCP4, helping to ensure the accuracy of future depositions.
蛋白质数据库(PDB)中的信息准确性对于众多利用蛋白质结构信息的下游应用至关重要。尽管已经付出了最大努力,但偶尔引入错误是不可避免的,尤其是在实验数据分辨率有限的情况下。先前已经建立了一种基于发现结构模型中观察到的残基接触和距离与诸如 AlphaFold2 等方法计算预测的残基接触和距离之间不一致的新型蛋白质结构验证方法。它特别适合检测注册错误。重要的是,这种新方法与基于立体化学或图谱-模型一致性的传统方法正交,并且不依赖于分辨率。在这里,通过扫描 PDB 中 3-5 Å 分辨率的结构来识别数千个可能的注册错误。与大多数方法不同,该方法的应用会对受影响区域的注册进行建议性修正,即使仅进行有限的实施,也会导致绝大多数情况下改进精修统计数据。该方法还对一些局限性和混杂因素(如折叠开关蛋白)进行了特征描述,但预计该方法将广泛应用于发现当前访问中的潜在问题,并通过在 CCP4 中的实施和分发,有助于确保未来存储的准确性。