Thiruv B, Quon G, Saldanha S A, Steipe B
Department of Biochemistry, University of Toronto, 1 Kings College Circle, Toronto, Ontario M5S 1A8, Canada.
BMC Struct Biol. 2005 Jul 12;5:12. doi: 10.1186/1472-6807-5-12.
The statistical analysis of protein structures requires datasets in which structural features can be considered independently distributed, i.e. not related through common ancestry, and that fulfil minimal requirements regarding the experimental quality of the structures it contains. However, non-redundant datasets based on sequence similarity invariably contain distantly related homologues. Here we provide a reference dataset of non-homologous protein domains, assuming that structural dissimilarity at the topology level is incompatible with recognizable common ancestry. The dataset is based on domains at the Topology level of the CATH database which hierarchically classifies all protein structures. It contains the best refined representatives of each Topology level, validates structural dissimilarity and removes internally duplicated fragments. The compilation of Nh3D is fully scripted.
The current Nh3D list contains 570 domains with a total of 90780 residues. It covers more than 70% of folds at the Topology level of the CATH database and represents more than 90% of the structures in the PDB that have been classified by CATH. We observe that even though all protein pairs are structurally dissimilar, some pairwise sequence identities after global alignment are greater than 30%.
Nh3D is freely available as a reference dataset for the statistical analysis of sequence and structure features of proteins in the PDB. Regularly updated versions of Nh3D and the corresponding PDB-formatted coordinate sets are accessible from our Web site http://www.schematikon.org.
蛋白质结构的统计分析需要数据集,其中结构特征可被视为独立分布,即不通过共同祖先相关联,并且其包含的结构在实验质量方面满足最低要求。然而,基于序列相似性的非冗余数据集总是包含远缘相关的同源物。在此,我们提供了一个非同源蛋白质结构域的参考数据集,假设拓扑水平上的结构差异与可识别的共同祖先不兼容。该数据集基于CATH数据库拓扑水平的结构域,该数据库对所有蛋白质结构进行层次分类。它包含每个拓扑水平的最佳精制代表,验证结构差异并去除内部重复片段。Nh3D的汇编完全由脚本完成。
当前的Nh3D列表包含570个结构域,共有90780个残基。它涵盖了CATH数据库拓扑水平上超过70%的折叠类型,代表了PDB中已被CATH分类的结构的90%以上。我们观察到,即使所有蛋白质对在结构上都不相似,但全局比对后的一些成对序列同一性大于30%。
Nh3D可作为免费参考数据集,用于对PDB中蛋白质的序列和结构特征进行统计分析。可从我们网页http://www.schematikon.org获取Nh3D的定期更新版本以及相应的PDB格式坐标集。