The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology, and Marine Science, University of the Ryukyus, Nishihara, Okinawa 903-0213, Japan.
J Chem Inf Model. 2010 Apr 26;50(4):690-700. doi: 10.1021/ci900452z.
The importance of thorough analyses of the secondary structures in proteins as basic structural units cannot be overemphasized. Although recent computational methods have achieved reasonably high accuracy for predicting secondary structures from amino acid sequences, a simple and fundamental empirical approach to characterize the amino acid composition of secondary structures was performed mainly in 1970s, with a small number of analyzed structures. To extend this classical approach using a large number of analyzed structures, here we characterized the amino acid sequences of secondary structures (12 154 alpha-helix units, 4592 3(10)-helix units, 16 787 beta-strand units, and 30 811 "other" units), using the representative three-dimensional protein structure records (1641 protein chains) from the Protein Data Bank. We first examined the length and the amino acid compositions of secondary structures, including rank order differences and assignment relationships among amino acids. These compositional results were largely, but not entirely, consistent with the previous studies. In addition, we examined the frequency of 400 amino acid doublets and 8000 triplets in secondary structures based on their relative counts, termed the availability. We identified not only some triplets that were specific to a certain secondary structure but also so-called zero-count triplets, which did not occur in a given secondary structure at all, even though they were probabilistically predicted to occur several times. Taken together, the present study revealed essential features of secondary structures and suggests potential applications in the secondary structure prediction and the functional design of protein sequences.
蛋白质作为基本结构单元的二级结构的深入分析非常重要。虽然最近的计算方法在预测氨基酸序列的二级结构方面已经达到了相当高的准确性,但在 20 世纪 70 年代,人们主要采用一种简单而基本的经验方法来描述二级结构的氨基酸组成,所分析的结构数量较少。为了使用大量分析的结构扩展这种经典方法,我们使用来自蛋白质数据库(Protein Data Bank)的具有代表性的三维蛋白质结构记录(1641 条蛋白质链),对二级结构的氨基酸序列(12154 个α-螺旋单元、4592 个 3(10)-螺旋单元、16787 个β-折叠单元和 30811 个“其他”单元)进行了特征描述。我们首先检查了二级结构的长度和氨基酸组成,包括氨基酸的等级差异和分配关系。这些组成结果在很大程度上与之前的研究一致,但并不完全一致。此外,我们还根据二级结构中相对出现的频率(称为可用性),检查了 400 个氨基酸二联体和 8000 个三联体的出现频率。我们不仅鉴定了一些特定于特定二级结构的三联体,还鉴定了一些所谓的零计数三联体,这些三联体根本没有出现在给定的二级结构中,尽管它们在概率上被预测会出现几次。总的来说,本研究揭示了二级结构的基本特征,并提出了在二级结构预测和蛋白质序列功能设计中的潜在应用。