从布鲁克海文蛋白质数据库中选择一组具有代表性的结构。

Selection of a representative set of structures from Brookhaven Protein Data Bank.

作者信息

Boberg J, Salakoski T, Vihinen M

机构信息

Department of Computer Science, University of Turku, Finland.

出版信息

Proteins. 1992 Oct;14(2):265-76. doi: 10.1002/prot.340140212.

DOI:10.1002/prot.340140212

PMID:1409573

Abstract

Reliable structural and statistical analyses of three dimensional protein structures should be based on unbiased data. The Protein Data Bank is highly redundant, containing several entries for identical or very similar sequences. A technique was developed for clustering the known structures based on their sequences and contents of alpha- and beta-structures. First, sequences were aligned pairwise. A representative sample of sequences was then obtained by grouping similar sequences together, and selecting a typical representative from each group. The similarity significance threshold needed in the clustering method was found by analyzing similarities of random sequences. Because three dimensional structures for proteins of same structural class are generally more conserved than their sequences, the proteins were clustered also according to their contents of secondary structural elements. The results of these clusterings indicate conservation of alpha- and beta-structures even when sequence similarity is relatively low. An unbiased sample of 103 high resolution structures, representing a wide variety of proteins, was chosen based on the suggestions made by the clustering algorithm. The proteins were divided into structural classes according to their contents and ratios of secondary structural elements. Previous classifications have suffered from subjective view of secondary structures, whereas here the classification was based on backbone geometry. The concise view lead to reclassification of some structures. The representative set of structures facilitates unbiased analyses of relationships between protein sequence, function, and structure as well as of structural characteristics.

摘要

对三维蛋白质结构进行可靠的结构和统计分析应基于无偏差的数据。蛋白质数据库高度冗余，包含相同或非常相似序列的多个条目。开发了一种基于已知结构的序列以及α-和β-结构含量进行聚类的技术。首先，将序列进行两两比对。然后通过将相似序列分组在一起，并从每组中选择一个典型代表来获得序列的代表性样本。通过分析随机序列的相似性来确定聚类方法所需的相似性显著性阈值。由于相同结构类别的蛋白质的三维结构通常比其序列更保守，因此还根据蛋白质二级结构元件的含量对其进行聚类。这些聚类结果表明，即使序列相似性相对较低，α-和β-结构也具有保守性。根据聚类算法的建议，选择了一个代表各种蛋白质的103个高分辨率结构的无偏差样本。根据蛋白质二级结构元件的含量和比例将其分为不同的结构类别。以前的分类受到二级结构主观观点的影响，而这里的分类是基于主链几何结构。这种简洁的观点导致了对一些结构的重新分类。该代表性结构集有助于对蛋白质序列、功能和结构之间的关系以及结构特征进行无偏差分析。