Nh3D：非同源蛋白质结构的参考数据集。

Nh3D: a reference dataset of non-homologous protein structures.

作者信息

Thiruv B, Quon G, Saldanha S A, Steipe B

机构信息

Department of Biochemistry, University of Toronto, 1 Kings College Circle, Toronto, Ontario M5S 1A8, Canada.

出版信息

BMC Struct Biol. 2005 Jul 12;5:12. doi: 10.1186/1472-6807-5-12.

DOI:10.1186/1472-6807-5-12

PMID:16011803

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1182382/

Abstract

BACKGROUND

The statistical analysis of protein structures requires datasets in which structural features can be considered independently distributed, i.e. not related through common ancestry, and that fulfil minimal requirements regarding the experimental quality of the structures it contains. However, non-redundant datasets based on sequence similarity invariably contain distantly related homologues. Here we provide a reference dataset of non-homologous protein domains, assuming that structural dissimilarity at the topology level is incompatible with recognizable common ancestry. The dataset is based on domains at the Topology level of the CATH database which hierarchically classifies all protein structures. It contains the best refined representatives of each Topology level, validates structural dissimilarity and removes internally duplicated fragments. The compilation of Nh3D is fully scripted.

RESULTS

The current Nh3D list contains 570 domains with a total of 90780 residues. It covers more than 70% of folds at the Topology level of the CATH database and represents more than 90% of the structures in the PDB that have been classified by CATH. We observe that even though all protein pairs are structurally dissimilar, some pairwise sequence identities after global alignment are greater than 30%.

CONCLUSION

Nh3D is freely available as a reference dataset for the statistical analysis of sequence and structure features of proteins in the PDB. Regularly updated versions of Nh3D and the corresponding PDB-formatted coordinate sets are accessible from our Web site http://www.schematikon.org.

摘要

背景

蛋白质结构的统计分析需要数据集，其中结构特征可被视为独立分布，即不通过共同祖先相关联，并且其包含的结构在实验质量方面满足最低要求。然而，基于序列相似性的非冗余数据集总是包含远缘相关的同源物。在此，我们提供了一个非同源蛋白质结构域的参考数据集，假设拓扑水平上的结构差异与可识别的共同祖先不兼容。该数据集基于CATH数据库拓扑水平的结构域，该数据库对所有蛋白质结构进行层次分类。它包含每个拓扑水平的最佳精制代表，验证结构差异并去除内部重复片段。Nh3D的汇编完全由脚本完成。

结果

当前的Nh3D列表包含570个结构域，共有90780个残基。它涵盖了CATH数据库拓扑水平上超过70%的折叠类型，代表了PDB中已被CATH分类的结构的90%以上。我们观察到，即使所有蛋白质对在结构上都不相似，但全局比对后的一些成对序列同一性大于30%。

结论

Nh3D可作为免费参考数据集，用于对PDB中蛋白质的序列和结构特征进行统计分析。可从我们网页http://www.schematikon.org获取Nh3D的定期更新版本以及相应的PDB格式坐标集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/565d/1182382/e5482fba5964/1472-6807-5-12-1.jpg

相似文献

Nh3D: a reference dataset of non-homologous protein structures.

BMC Struct Biol. 2005 Jul 12;5:12. doi: 10.1186/1472-6807-5-12.

PIBASE: a comprehensive database of structurally defined protein interfaces.

Bioinformatics. 2005 May 1;21(9):1901-7. doi: 10.1093/bioinformatics/bti277. Epub 2005 Jan 18.

Comparison of sequence and structure-based datasets for nonredundant structural data mining.

Proteins. 2005 Sep 1;60(4):577-83. doi: 10.1002/prot.20505.

Intrinsic disorder in the Protein Data Bank.

J Biomol Struct Dyn. 2007 Feb;24(4):325-42. doi: 10.1080/07391102.2007.10507123.

Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method.

BMC Bioinformatics. 2005 Jan 12;6:7. doi: 10.1186/1471-2105-6-7.

The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D247-51. doi: 10.1093/nar/gki024.

Progress of structural genomics initiatives: an analysis of solved target structures.

J Mol Biol. 2005 May 20;348(5):1235-60. doi: 10.1016/j.jmb.2005.03.037. Epub 2005 Apr 2.

Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins.

Bioinformatics. 2005 Apr 1;21(7):1010-9. doi: 10.1093/bioinformatics/bti128. Epub 2004 Nov 5.

A comprehensive and non-redundant database of protein domain movements.

Bioinformatics. 2005 Jun 15;21(12):2832-8. doi: 10.1093/bioinformatics/bti420. Epub 2005 Mar 31.

FeatureMap3D--a tool to map protein features and sequence conservation onto homologous structures in the PDB.

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W84-8. doi: 10.1093/nar/gkl227.

引用本文的文献

Identification and visualization of protein binding regions with the ArDock server.

Nucleic Acids Res. 2018 Jul 2;46(W1):W417-W422. doi: 10.1093/nar/gky472.

Polymer uncrossing and knotting in protein folding, and their role in minimal folding pathways.

PLoS One. 2013;8(1):e53642. doi: 10.1371/journal.pone.0053642. Epub 2013 Jan 24.

Arbitrary protein-protein docking targets biologically relevant interfaces.

BMC Biophys. 2012 May 6;5:7. doi: 10.1186/2046-1682-5-7.

Structural motif screening reveals a novel, conserved carbohydrate-binding surface in the pathogenesis-related protein PR-5d.

BMC Struct Biol. 2010 Aug 3;10:23. doi: 10.1186/1472-6807-10-23.

Tableau-based protein substructure search using quadratic programming.

BMC Bioinformatics. 2009 May 19;10:153. doi: 10.1186/1471-2105-10-153.

Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'.

BMC Bioinformatics. 2008 Sep 23;9:394. doi: 10.1186/1471-2105-9-394.

A simple and fast heuristic for protein structure comparison.

BMC Bioinformatics. 2008 Mar 25;9:161. doi: 10.1186/1471-2105-9-161.

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.

BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.

The N terminus of Saccharomyces cerevisiae Msh6 is an unstructured tether to PCNA.

Mol Cell. 2007 May 25;26(4):565-78. doi: 10.1016/j.molcel.2007.04.024.

本文引用的文献

PISCES: a protein sequence culling server.

Bioinformatics. 2003 Aug 12;19(12):1589-91. doi: 10.1093/bioinformatics/btg224.

PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003.

Nucleic Acids Res. 2003 Jan 1;31(1):492-3. doi: 10.1093/nar/gkg022.

PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB).

Nucleic Acids Res. 2001 Jan 1;29(1):219-20. doi: 10.1093/nar/29.1.219.

Rapid automatic detection and alignment of repeats in protein sequences.

Proteins. 2000 Nov 1;41(2):224-37. doi: 10.1002/1097-0134(20001101)41:2<224::aid-prot70>3.0.co;2-z.

EMBOSS: the European Molecular Biology Open Software Suite.

Trends Genet. 2000 Jun;16(6):276-7. doi: 10.1016/s0168-9525(00)02024-2.

A census of protein repeats.

J Mol Biol. 1999 Oct 15;293(1):151-60. doi: 10.1006/jmbi.1999.3136.

Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function.

J Mol Biol. 1999 Aug 6;291(1):177-96. doi: 10.1006/jmbi.1999.2911.

CATH--a hierarchic classification of protein domain structures.

Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8.

Knowledge-based protein secondary structure assignment.

Proteins. 1995 Dec;23(4):566-79. doi: 10.1002/prot.340230412.

PROMOTIF--a program to identify and analyze structural motifs in proteins.

Protein Sci. 1996 Feb;5(2):212-20. doi: 10.1002/pro.5560050204.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Nh3D：非同源蛋白质结构的参考数据集。

Nh3D: a reference dataset of non-homologous protein structures.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献