Suppr超能文献

利用接触图约束估计蛋白质的概率上下文无关文法。

Estimating probabilistic context-free grammars for proteins using contact map constraints.

作者信息

Dyrka Witold, Pyzik Mateusz, Coste François, Talibart Hugo

机构信息

Wydział Podstawowych Problemów Techniki, Katedra Inżynierii Biomedycznej, Politechnika Wrocławska, Wrocław, Poland.

Univ Rennes, Inria, CNRS, IRISA, Rennes, France.

出版信息

PeerJ. 2019 Mar 18;7:e6559. doi: 10.7717/peerj.6559. eCollection 2019.

Abstract

Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

摘要

在空间结构上接近但序列上不一定相邻的氨基酸之间的相互作用,在蛋白质中发挥着重要的结构和功能作用。在对蛋白质集合进行建模时,应考虑这些非局部相互作用。然而,相关蛋白质序列集最流行的表示形式仍然是轮廓隐马尔可夫模型。通过独立建模蛋白质基础多序列比对中保守列的分布,这些模型无法捕捉蛋白质残基之间的依赖性。非局部相互作用可以用更具表现力的语法模型来表示。然而,学习这样的语法很困难。在这项工作中,我们建议利用蛋白质接触信息来促进对表示蛋白质序列家族的概率上下文无关语法的训练。我们发展了在最大似然和对比估计方案中引入接触约束背后的理论,并将其在用于蛋白质语法的机器学习框架中实现。与无接触约束的学习相比,所提出的框架在蛋白质基序样本上进行了测试。评估表明语法描述符对蛋白质结构具有高保真度,并且在识别序列方面提高了精度。最后,我们给出了在实际场景中使用我们方法的一个例子,并通过创建一个蛋白质基序元家族的语法模型,展示了其超越当前技术水平的潜力。我们得出结论,当前的这项研究朝着更灵活、准确地对蛋白质序列集合进行建模迈出了重要一步。该软件包已向社区提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bfd2/6428041/4907278942fc/peerj-07-6559-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验