Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
AstraZeneca, Biomedical Campus, 1 Francis Crick Ave, Trumpington, Cambridge, CB2 0AA, UK.
Sci Data. 2023 Apr 12;10(1):204. doi: 10.1038/s41597-023-02101-6.
More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.
超过 61000 种蛋白质的氨基酸序列(UniProtKB)和它们的三维结构(PDB)之间有最新的对应关系,这得益于结构整合功能、分类和序列(SIFTS)资源。SIFTS 整合了来自许多其他生物资源的残基水平注释。SIFTS 数据以 XML、CSV 和 TSV 格式等多种格式提供,也可以通过 PDBe REST API 访问,但始终与 PDB 档案中的结构数据(PDBx/mmCIF 文件)分开维护。在这里,我们扩展了 wwPDB PDBx/mmCIF 数据字典,增加了额外的类别,以容纳 SIFTS 数据,并直接将 UniProtKB、Pfam、SCOP2 和 CATH 的残基水平注释添加到来自 PDB 档案的 PDBx/mmCIF 文件中。通过整合的 UniProtKB 注释,这些文件现在为不同 PDB 条目中的残基提供了一致的编号,允许轻松比较结构模型。扩展后的字典提供了更一致、标准化的元数据描述,而不会改变核心 PDB 信息。这一发展实现了残基水平的最新交叉引用信息,从而提高了数据互操作性,支持改进的数据分析和可视化。