利用接触图约束估计蛋白质的概率上下文无关文法。

Estimating probabilistic context-free grammars for proteins using contact map constraints.

作者信息

Dyrka Witold, Pyzik Mateusz, Coste François, Talibart Hugo

机构信息

Wydział Podstawowych Problemów Techniki, Katedra Inżynierii Biomedycznej, Politechnika Wrocławska, Wrocław, Poland.

Univ Rennes, Inria, CNRS, IRISA, Rennes, France.

出版信息

PeerJ. 2019 Mar 18;7:e6559. doi: 10.7717/peerj.6559. eCollection 2019.

DOI:10.7717/peerj.6559

PMID:30918754

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6428041/

Abstract

Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

摘要

在空间结构上接近但序列上不一定相邻的氨基酸之间的相互作用，在蛋白质中发挥着重要的结构和功能作用。在对蛋白质集合进行建模时，应考虑这些非局部相互作用。然而，相关蛋白质序列集最流行的表示形式仍然是轮廓隐马尔可夫模型。通过独立建模蛋白质基础多序列比对中保守列的分布，这些模型无法捕捉蛋白质残基之间的依赖性。非局部相互作用可以用更具表现力的语法模型来表示。然而，学习这样的语法很困难。在这项工作中，我们建议利用蛋白质接触信息来促进对表示蛋白质序列家族的概率上下文无关语法的训练。我们发展了在最大似然和对比估计方案中引入接触约束背后的理论，并将其在用于蛋白质语法的机器学习框架中实现。与无接触约束的学习相比，所提出的框架在蛋白质基序样本上进行了测试。评估表明语法描述符对蛋白质结构具有高保真度，并且在识别序列方面提高了精度。最后，我们给出了在实际场景中使用我们方法的一个例子，并通过创建一个蛋白质基序元家族的语法模型，展示了其超越当前技术水平的潜力。我们得出结论，当前的这项研究朝着更灵活、准确地对蛋白质序列集合进行建模迈出了重要一步。该软件包已向社区提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bfd2/6428041/4907278942fc/peerj-07-6559-g001.jpg

相似文献

Estimating probabilistic context-free grammars for proteins using contact map constraints.利用接触图约束估计蛋白质的概率上下文无关文法。

PeerJ. 2019 Mar 18;7:e6559. doi: 10.7717/peerj.6559. eCollection 2019.

Probabilistic grammatical model for helix-helix contact site classification.用于螺旋-螺旋接触位点分类的概率语法模型。

Algorithms Mol Biol. 2013 Dec 18;8(1):31. doi: 10.1186/1748-7188-8-31.

Searching for universal model of amyloid signaling motifs using probabilistic context-free grammars.使用概率上下文无关语法搜索淀粉样蛋白信号基序的通用模型。

BMC Bioinformatics. 2021 Apr 29;22(1):222. doi: 10.1186/s12859-021-04139-y.

A stochastic context free grammar based framework for analysis of protein sequences.基于随机上下文无关语法的蛋白质序列分析框架。

BMC Bioinformatics. 2009 Oct 8;10:323. doi: 10.1186/1471-2105-10-323.

Probabilistic context-free grammars estimated from infinite distributions.从无限分布估计的概率上下文无关文法。

IEEE Trans Pattern Anal Mach Intell. 2007 Aug;29(8):1379-93. doi: 10.1109/TPAMI.2007.1065.

Corpus based learning of stochastic context-free grammar combined with hidden Markov models for tRNA modelling.基于语料库的随机上下文无关语法学习与隐马尔可夫模型相结合用于tRNA建模。

Conf Proc IEEE Eng Med Biol Soc. 2004;2004:2785-8. doi: 10.1109/IEMBS.2004.1403796.

Predicting location and structure of beta-sheet regions using stochastic tree grammars.使用随机树文法预测β-折叠区域的位置和结构。

Proc Int Conf Intell Syst Mol Biol. 1994;2:276-84.

Pair hidden Markov models on tree structures.树结构上的成对隐马尔可夫模型。

Bioinformatics. 2003;19 Suppl 1:i232-40. doi: 10.1093/bioinformatics/btg1032.

Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model.通过隐马尔可夫模型的蒙特卡罗优化实现蛋白质序列基序的间隙比对。

BMC Bioinformatics. 2004 Oct 25;5:157. doi: 10.1186/1471-2105-5-157.

Developmental Constraints on Learning Artificial Grammars with Fixed, Flexible and Free Word Order.学习具有固定、灵活和自由词序的人工语法的发展限制

Front Psychol. 2017 Oct 17;8:1816. doi: 10.3389/fpsyg.2017.01816. eCollection 2017.

引用本文的文献

Harnessing deep learning for proteome-scale detection of amyloid signaling motifs.利用深度学习进行蛋白质组规模的淀粉样信号基序检测。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i420-i428. doi: 10.1093/bioinformatics/btaf200.

Beta Sandwich-Like Folds: Sequences, Contacts, Classification of Invariant Substructures and Beta Sandwich Protein Grammar.β三明治样折叠：序列、接触、不变子结构分类和β三明治蛋白语法。

Methods Mol Biol. 2025;2870:51-62. doi: 10.1007/978-1-0716-4213-9_4.

Exploring a diverse world of effector domains and amyloid signaling motifs in fungal NLR proteins.探索真菌 NLR 蛋白中效应结构域和淀粉样信号基序的多样化世界。

PLoS Comput Biol. 2022 Dec 21;18(12):e1010787. doi: 10.1371/journal.pcbi.1010787. eCollection 2022 Dec.

Searching for universal model of amyloid signaling motifs using probabilistic context-free grammars.使用概率上下文无关语法搜索淀粉样蛋白信号基序的通用模型。

BMC Bioinformatics. 2021 Apr 29;22(1):222. doi: 10.1186/s12859-021-04139-y.

本文引用的文献

Mutation effects predicted from sequence co-variation.根据序列共变预测的突变效应。

Nat Biotechnol. 2017 Feb;35(2):128-135. doi: 10.1038/nbt.3769. Epub 2017 Jan 16.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.基于超深度学习模型的蛋白质接触图从头精确预测

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库：迈向更可持续的未来。

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE.使用FEATURE对三维蛋白质结构中的钙结合位点进行高分辨率预测。

J Chem Inf Model. 2015 Aug 24;55(8):1663-72. doi: 10.1021/acs.jcim.5b00367. Epub 2015 Aug 10.

Theme and variations: evolutionary diversification of the HET-s functional amyloid motif.主题与变奏：HET-s功能性淀粉样基序的进化多样化

Sci Rep. 2015 Jul 29;5:12494. doi: 10.1038/srep12494.

Signal transduction by a fungal NOD-like receptor based on propagation of a prion amyloid fold.基于朊病毒淀粉样折叠传播的真菌NOD样受体信号转导。

PLoS Biol. 2015 Feb 11;13(2):e1002059. doi: 10.1371/journal.pbio.1002059. eCollection 2015 Feb.

Diversity and variability of NOD-like receptors in fungi.真菌中NOD样受体的多样性和变异性。

Genome Biol Evol. 2014 Nov 13;6(12):3137-58. doi: 10.1093/gbe/evu251.

CCMpred--fast and precise prediction of protein residue-residue contacts from correlated mutations.CCMpred--快速准确地预测蛋白质残基-残基接触的相关突变。

Bioinformatics. 2014 Nov 1;30(21):3128-30. doi: 10.1093/bioinformatics/btu500. Epub 2014 Jul 26.

Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information.利用进化信息对蛋白质界面上的残基-残基相互作用进行稳健且准确的预测。

Elife. 2014 May 1;3:e02030. doi: 10.7554/eLife.02030.

Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners.蛋白质家族的快速准确多变量高斯建模：预测残基接触和蛋白质相互作用伙伴。

PLoS One. 2014 Mar 24;9(3):e92721. doi: 10.1371/journal.pone.0092721. eCollection 2014.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用接触图约束估计蛋白质的概率上下文无关文法。

Estimating probabilistic context-free grammars for proteins using contact map constraints.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献