Suppr超能文献

蛋白质数据库中蛋白质子序列二级结构倾向的信息量

Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank.

作者信息

Kondo Ryohei, Kasahara Kota, Takahashi Takuya

机构信息

Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan.

College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan.

出版信息

Biophys Physicobiol. 2022 Feb 8;19:1-12. doi: 10.2142/biophysico.bppb-v19.0002. eCollection 2022.

Abstract

Elucidating the principles of sequence-structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or -gram (4≤≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each . Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of -grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown -grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB.

摘要

阐明蛋白质序列与结构关系的原理是生物学中一个长期存在的问题。蛋白质短片段的性质由该片段本身的子序列及其环境共同决定。例如,一种子序列,即所谓的变色龙序列,会根据其所处环境形成不同的二级结构。变色龙序列被认为形成特定结构的倾向较弱。尽管已经鉴定出许多变色龙序列,但它们只是蛋白质组中所有可能子序列的一小部分。每个子序列形成特定结构的倾向强度尚未得到充分量化。在本研究中,我们全面分析了蛋白质数据库(PDB)中无冗余序列中观察到的由4至9个氨基酸残基组成的子序列,即 -gram(4≤≤9)。根据二级结构和可及表面积形成特定结构的倾向被量化为每个 的信息量。尽管由于当前PDB中样本不足,大多数观察到的子序列信息量较低,但仍发现了数千个具有强烈倾向的 -gram,包括已知的结构基序。此外,机器学习部分预测了未知 -gram的倾向,因此,该技术有助于从PDB中有限数量的样本中提取知识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/472a/8926306/68275d11f87a/19_e190002-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验