Department of Biochemistry, Duke University, Durham, North Carolina, USA.
Protein Sci. 2022 Jan;31(1):290-300. doi: 10.1002/pro.4239. Epub 2021 Nov 29.
We have curated a high-quality, "best-parts" reference dataset of about 3 million protein residues in about 15,000 PDB-format coordinate files, each containing only residues with good electron density support for a physically acceptable model conformation. The resulting prefiltered data typically contain the entire core of each chain, in quite long continuous fragments. Each reference file is a single protein chain, and the total set of files were selected for low redundancy, high resolution, good MolProbity score, and other chain-level criteria. Then each residue was critically tested for adequate local map quality to firmly support its conformation, which must also be free of serious clashes or covalent-geometry outliers. The resulting Top2018 prefiltered datasets have been released on the Zenodo online web service and are freely available for all uses under a Creative Commons license. Currently, one dataset is residue filtered on main chain plus Cβ atoms, and a second dataset is full-residue filtered; each is available at four different sequence-identity levels. Here, we illustrate both statistics and examples that show the beneficial consequences of residue-level filtering. That process is necessary because even the best of structures contain a few highly disordered local regions with poor density and low-confidence conformations that should not be included in reference data. Therefore, the open distribution of these very large, prefiltered reference datasets constitutes a notable advance for structural bioinformatics and the fields that depend upon it.
我们精心策划了一个高质量的、包含约 15000 个 PDB 格式坐标文件的“最佳部分”参考数据集,其中包含约 300 万个蛋白质残基,每个文件仅包含具有良好电子密度支持的物理可接受模型构象的残基。由此产生的预过滤数据通常包含每个链的整个核心,并且是相当长的连续片段。每个参考文件都是单个蛋白质链,所选文件集的特征为低冗余、高分辨率、良好的 MolProbity 评分以及其他链级标准。然后,每个残基都经过严格测试,以确保其局部图谱质量足以坚定地支持其构象,而且构象中不得存在严重冲突或共价几何异常。经过筛选的 Top2018 预过滤数据集已在 Zenodo 在线网络服务上发布,并可根据知识共享许可协议免费用于所有用途。目前,有一个数据集是基于主链和 Cβ 原子进行残基过滤的,另一个数据集是全残基过滤的;每个数据集都有四个不同的序列同一性水平。在这里,我们展示了统计数据和示例,说明了残基过滤的有益结果。该过程是必要的,因为即使是最好的结构也包含一些具有较差密度和低置信度构象的高度无序局部区域,这些区域不应包含在参考数据中。因此,这些非常大的、预过滤的参考数据集的公开分发是结构生物信息学及其依赖领域的重要进展。