Kondo Ryohei, Kasahara Kota, Takahashi Takuya
Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan.
College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan.
Biophys Physicobiol. 2022 Feb 8;19:1-12. doi: 10.2142/biophysico.bppb-v19.0002. eCollection 2022.
Elucidating the principles of sequence-structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or -gram (4≤≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each . Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of -grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown -grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB.
阐明蛋白质序列与结构关系的原理是生物学中一个长期存在的问题。蛋白质短片段的性质由该片段本身的子序列及其环境共同决定。例如,一种子序列,即所谓的变色龙序列,会根据其所处环境形成不同的二级结构。变色龙序列被认为形成特定结构的倾向较弱。尽管已经鉴定出许多变色龙序列,但它们只是蛋白质组中所有可能子序列的一小部分。每个子序列形成特定结构的倾向强度尚未得到充分量化。在本研究中,我们全面分析了蛋白质数据库(PDB)中无冗余序列中观察到的由4至9个氨基酸残基组成的子序列,即 -gram(4≤≤9)。根据二级结构和可及表面积形成特定结构的倾向被量化为每个 的信息量。尽管由于当前PDB中样本不足,大多数观察到的子序列信息量较低,但仍发现了数千个具有强烈倾向的 -gram,包括已知的结构基序。此外,机器学习部分预测了未知 -gram的倾向,因此,该技术有助于从PDB中有限数量的样本中提取知识。