使用结构字母表和统计异常挖掘蛋白质环。

Mining protein loops using a structural alphabet and statistical exceptionality.

机构信息

MTi, Inserm UMR-S 973, Université Paris Diderot- Paris 7, Paris, F-75205 Cedex 13, France.

出版信息

BMC Bioinformatics. 2010 Feb 4;11:75. doi: 10.1186/1471-2105-11-75.

DOI:10.1186/1471-2105-11-75

PMID:20132552

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2833150/

Abstract

BACKGROUND

Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.

RESULTS

We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 A). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.

CONCLUSIONS

We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.

摘要

背景

蛋白质环包含了三维结构中 50%的蛋白质残基。这些区域通常与蛋白质的功能有关，例如结合部位、催化口袋等。然而，用传统工具描述蛋白质环是一项艰巨的任务。规则的二级结构，如螺旋和链，已经得到了广泛的研究，而环由于其序列和结构高度可变，因此难以分析。由于数据稀疏，长环很少被系统地研究。

结果

我们开发了一种简单而准确的方法，允许使用结构基序来描述和分析短环和长环的结构，而不受环长度的限制。该方法基于结构字母 HMM-SA。HMM-SA 允许将三维蛋白质结构简化为一维状态字符串，其中每个状态是一个由四个残基组成的原型片段，称为结构字母。因此，通过将结构字母字符串作为常规蛋白质序列分析来处理，就可以轻松完成对庞大数据集的结构分组任务。我们系统地提取了 93000 个蛋白质环库中的所有七个残基片段，并根据结构字母序列进行了分组，称为结构字。这种方法允许对所有大小的环进行系统分析，因为我们考虑的是七个残基的结构基序，而不是完整的环。我们将分析重点放在高度重复的环结构字（观察到 30 次以上）上。我们的研究表明，在观察到的 28274 个结构字中，只有 3310 个高度重复的结构字（观察到 30 次以上）覆盖了 73%的环长度。这些结构字的结构变异性较低（平均 RMSd 为 0.85A）。正如预期的那样，这些基序中有一半表现出侧翼区域偏好，但有趣的是，三分之二的基序存在于短（小于 12 个残基）和长环中。此外，一半的重复基序表现出显著的氨基酸保守性，至少有四个显著位置，87%的长环包含至少一个这样的基序。我们通过检测结构字母的统计上过度表达模式（如在常规 DNA 序列分析中）来补充我们的分析。约 30%（930 个）的结构字过度表达，覆盖了约 40%的环长度。有趣的是，这些字表现出较低的结构变异性和较高的序列特异性，表明存在结构或功能限制。

结论

我们开发了一种使用重复结构基序系统地分解和研究蛋白质环的方法。该方法基于结构字母 HMM-SA，而不是结构比对和几何参数。我们提取了在短环和长环中都存在的有意义的结构基序。据我们所知，这是首次使用模式挖掘来提高蛋白质环中的信号噪声比。这一发现有助于更好地描述蛋白质环，并可能有助于降低长环分析的复杂性。详细结果可在 http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a1/2833150/451cc96140fe/1471-2105-11-75-1.jpg

相似文献

Mining protein loops using a structural alphabet and statistical exceptionality.使用结构字母表和统计异常挖掘蛋白质环。

BMC Bioinformatics. 2010 Feb 4;11:75. doi: 10.1186/1471-2105-11-75.

SA-Mot: a web server for the identification of motifs of interest extracted from protein loops.SA-Mot：一个从蛋白质环中提取感兴趣的模体的 web 服务器。

Nucleic Acids Res. 2011 Jul;39(Web Server issue):W203-9. doi: 10.1093/nar/gkr410. Epub 2011 Jun 10.

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs.用统计手术刀剖析蛋白质环，提示了某些结构模体的功能意义。

BMC Bioinformatics. 2011 Jun 20;12:247. doi: 10.1186/1471-2105-12-247.

New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification.基于详尽的环分类对短至中等大小蛋白质环进行新型高效的统计序列依赖性结构预测。

J Mol Biol. 1999 Jun 25;289(5):1469-90. doi: 10.1006/jmbi.1999.2826.

Protein structure mining using a structural alphabet.使用结构字母表进行蛋白质结构挖掘。

Proteins. 2008 May 1;71(2):920-37. doi: 10.1002/prot.21776.

Discovering structural motifs using a structural alphabet: application to magnesium-binding sites.使用结构字母表发现结构基序：应用于镁结合位点

BMC Bioinformatics. 2007 Mar 28;8:106. doi: 10.1186/1471-2105-8-106.

Use of a structural alphabet for analysis of short loops connecting repetitive structures.使用结构字母表分析连接重复结构的短环。

BMC Bioinformatics. 2004 May 12;5:58. doi: 10.1186/1471-2105-5-58.

Structural alphabets for protein structure classification: a comparison study.用于蛋白质结构分类的结构字母表：一项比较研究。

J Mol Biol. 2009 Mar 27;387(2):431-50. doi: 10.1016/j.jmb.2008.12.044. Epub 2008 Dec 25.

Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction.连接规则二级结构的短环和中环的构象分析与聚类：一个用于建模和预测的数据库。

Protein Sci. 1996 Dec;5(12):2600-16. doi: 10.1002/pro.5560051223.

Taxonomy and conformational analysis of loops in proteins.蛋白质中环的分类与构象分析

J Mol Biol. 1992 Apr 5;224(3):685-99. doi: 10.1016/0022-2836(92)90553-v.

引用本文的文献

In silico identification of novel ligands targeting stress-related human FKBP5 protein in mental disorders.通过计算机模拟鉴定针对精神疾病中与应激相关的人类FKBP5蛋白的新型配体。

PLoS One. 2025 Mar 17;20(3):e0320017. doi: 10.1371/journal.pone.0320017. eCollection 2025.

Geometric descriptors for beta turns.β 转角的几何描述符。

Protein Sci. 2024 Sep;33(9):e5159. doi: 10.1002/pro.5159.

High-throughput sequencing analysis of nuclear-encoded mitochondrial genes reveals a genetic signature of human longevity.高通量测序分析核编码线粒体基因揭示了人类长寿的遗传特征。

Geroscience. 2023 Feb;45(1):311-330. doi: 10.1007/s11357-022-00634-z. Epub 2022 Aug 10.

Simple Selection Procedure to Distinguish between Static and Flexible Loops.简单选择程序以区分静态和弹性循环。

Int J Mol Sci. 2020 Mar 26;21(7):2293. doi: 10.3390/ijms21072293.

SAFlex: A structural alphabet extension to integrate protein structural flexibility and missing data information.SAFlex：一种结构字母扩展，用于整合蛋白质结构的灵活性和缺失数据信息。

PLoS One. 2018 Jul 5;13(7):e0198854. doi: 10.1371/journal.pone.0198854. eCollection 2018.

Analysis of the HIV-2 protease's adaptation to various ligands: characterization of backbone asymmetry using a structural alphabet.分析 HIV-2 蛋白酶对各种配体的适应性：使用结构字母表对骨架不对称性进行表征。

Sci Rep. 2018 Jan 15;8(1):710. doi: 10.1038/s41598-017-18941-3.

Exploring the potential of a structural alphabet-based tool for mining multiple target conformations and target flexibility insight.探索一种基于结构字母表的工具挖掘多个目标构象及洞察目标灵活性的潜力。

PLoS One. 2017 Aug 17;12(8):e0182972. doi: 10.1371/journal.pone.0182972. eCollection 2017.

Considerations of Protein Subpockets in Fragment-Based Drug Design.基于片段的药物设计中蛋白质亚口袋的考量

Chem Biol Drug Des. 2016 Jan;87(1):5-20. doi: 10.1111/cbdd.12631. Epub 2015 Aug 31.

Critical Role of a Loop at C-Terminal Domain on the Conformational Stability and Catalytic Efficiency of Chondroitinase ABC I.软骨素酶ABC I C末端结构域的一个环对其构象稳定性和催化效率的关键作用

Mol Biotechnol. 2015 Aug;57(8):727-34. doi: 10.1007/s12033-015-9864-3.

Conformational sampling in template-free protein loop structure modeling: an overview.无模板蛋白质环结构建模中的构象采样：综述

Comput Struct Biotechnol J. 2013 Feb 25;5:e201302003. doi: 10.5936/csbj.201302003. eCollection 2013.

本文引用的文献

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.马尔可夫源生成的一组随机序列中模式的精确分布：在生物数据中的应用。

Algorithms Mol Biol. 2010 Jan 26;5:15. doi: 10.1186/1748-7188-5-15.

Structural motifs recurring in different folds recognize the same ligand fragments.在不同折叠结构中反复出现的结构基序识别相同的配体片段。

BMC Bioinformatics. 2009 Jun 15;10:182. doi: 10.1186/1471-2105-10-182.

PEP-FOLD: an online resource for de novo peptide structure prediction.PEP-FOLD：一种用于从头预测肽结构的在线资源。

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W498-503. doi: 10.1093/nar/gkp323. Epub 2009 May 11.

Motivated proteins: a web application for studying small three-dimensional protein motifs.激发蛋白：一个用于研究小型三维蛋白质基序的网络应用程序。

BMC Bioinformatics. 2009 Feb 11;10:60. doi: 10.1186/1471-2105-10-60.

MSDmotif: exploring protein sites and motifs.MSD基序：探索蛋白质位点和基序。

BMC Bioinformatics. 2008 Jul 17;9:312. doi: 10.1186/1471-2105-9-312.

FunClust: a web server for the identification of structural motifs in a set of non-homologous protein structures.FunClust：一个用于识别一组非同源蛋白质结构中结构基序的网络服务器。

BMC Bioinformatics. 2008 Mar 26;9 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2105-9-S2-S2.

MegaMotifBase: a database of structural motifs in protein families and superfamilies.MegaMotifBase：蛋白质家族和超家族中结构基序的数据库。

Nucleic Acids Res. 2008 Jan;36(Database issue):D218-21. doi: 10.1093/nar/gkm794. Epub 2007 Oct 11.

In silico local structure approach: a case study on outer membrane proteins.计算机模拟局部结构方法：以外膜蛋白为例的研究

Proteins. 2008 Apr;71(1):92-109. doi: 10.1002/prot.21659.

LFM-Pro: a tool for detecting significant local structural sites in proteins.LFM-Pro：一种用于检测蛋白质中重要局部结构位点的工具。

Bioinformatics. 2007 Mar 15;23(6):709-16. doi: 10.1093/bioinformatics/btl685. Epub 2007 Jan 19.

iGibbs: improving Gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling.iGibbs：通过序列聚类和迭代模式采样改进蛋白质的吉布斯基序采样器

Proteins. 2007 Feb 15;66(3):671-81. doi: 10.1002/prot.21153.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用结构字母表和统计异常挖掘蛋白质环。

Mining protein loops using a structural alphabet and statistical exceptionality.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献