用于RNA二级结构预测的复杂SCFG设计评估。

Nebel Markus E, Scheid Anika

Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany.

Theory Biosci. 2011 Dec;130(4):313-36. doi: 10.1007/s12064-011-0139-7. Epub 2011 Dec 2.

Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.

预测RNA分子的二级结构是计算结构生物学的基本问题之一，也是一项具有挑战性的任务。在过去几十年里，主要考虑了两种不同的方法来根据单个序列计算RNA二级结构的预测：第一种方法依赖于基于物理的模型，另一种则依赖于概率RNA模型。特别地，自由能最小化（MFE）方法通常被认为是最流行和成功的方法。此外，基于McCaskill的范式转换工作，该工作提出了基于热力学计算配分函数（PFs）和碱基对概率的方法，在过去几年中发明了几种扩展的配分函数算法、统计采样方法和聚类技术。然而，相应算法的准确性受到基础物理模型质量的限制，这些模型包含大量热力学参数且仍不完整。与之竞争的概率方法基于随机上下文无关文法（SCFGs）或相应的推广，如条件对数线性模型（CLLMs）。这些方法不考虑自由能，而是试图通过从可信的RNA结构数据库中学习（数量可控的）概率参数来了解分子的结构行为。在这项工作中，我们引入并评估了一种复杂的SCFG设计，该设计通过区分RNA的所有暗示不同能量规则的特征，来反映基于物理的RNA结构预测的最新方法。这种SCFG实际上是一种用于单序列RNA二级结构的统计采样算法的基础，该算法代表了PF方法采样扩展的概率对应物。此外，还提出了一些从生成的样本集中导出有意义结构预测的新方法。它们用于将我们模型的预测准确性与其他概率和基于能量的预测方法进行比较。特别地，与用于RNA结构预测的轻量级SCFG和相应的CLLM的比较表明，更复杂的SCFG设计可能会产生更高的准确性，但最终需要更全面和纯净的训练集。对预测折叠的准确性和生成样本集的整体质量（特别是在与生物学家相关的抽象层面，即生成结构的抽象形状）的研究得出结论，PF采样方法的玻尔兹曼分布比复杂SCFG模型诱导的系综分布更集中，这意味着生成样本中的结构多样性更大。一般来说，这两种不同的系综分布都不比另一种更合适，并且通过统计采样获得的相应结果预计会有根本差异，因此对于特定输入序列首选的方法强烈依赖于所考虑的RNA类型。

相似文献

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction.

Theory Biosci. 2011 Dec;130(4):313-36. doi: 10.1007/s12064-011-0139-7. Epub 2011 Dec 2.

Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures.

BMC Bioinformatics. 2012 Jul 9;13:159. doi: 10.1186/1471-2105-13-159.

CONTRAfold: RNA secondary structure prediction without physics-based models.

Bioinformatics. 2006 Jul 15;22(14):e90-8. doi: 10.1093/bioinformatics/btl246.

Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction.

BMC Bioinformatics. 2004 Jun 4;5:71. doi: 10.1186/1471-2105-5-71.

Statistical and Bayesian approaches to RNA secondary structure prediction.

RNA. 2006 Mar;12(3):323-31. doi: 10.1261/rna.2274106.

Analysis of energy-based algorithms for RNA secondary structure prediction.

BMC Bioinformatics. 2012 Feb 1;13:22. doi: 10.1186/1471-2105-13-22.

SCFGs in RNA secondary structure prediction RNA secondary structure prediction: a hands-on approach.

Methods Mol Biol. 2014;1097:143-62. doi: 10.1007/978-1-62703-709-9_8.

A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.

RNA. 2012 Feb;18(2):193-212. doi: 10.1261/rna.030049.111. Epub 2011 Dec 22.

Stochastic modeling of RNA pseudoknotted structures: a grammatical approach.

Bioinformatics. 2003;19 Suppl 1:i66-73. doi: 10.1093/bioinformatics/btg1007.

RNA Secondary Structure Thermodynamics.

Methods Mol Biol. 2024;2726:45-83. doi: 10.1007/978-1-0716-3519-3_3.

引用本文的文献

Markov Chain-Based Sampling for Exploring RNA Secondary Structure under the Nearest Neighbor Thermodynamic Model and Extended Applications.

Math Comput Appl. 2020 Dec;25(4). doi: 10.3390/mca25040067. Epub 2020 Oct 10.

RNA folding with hard and soft constraints.

Algorithms Mol Biol. 2016 Apr 23;11:8. doi: 10.1186/s13015-016-0070-z. eCollection 2016.

The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective.

RNA Biol. 2013 Jul;10(7):1185-96. doi: 10.4161/rna.24971. Epub 2013 May 10.

Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures.

BMC Bioinformatics. 2012 Jul 9;13:159. doi: 10.1186/1471-2105-13-159.

本文引用的文献

A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.

RNA. 2012 Feb;18(2):193-212. doi: 10.1261/rna.030049.111. Epub 2011 Dec 22.

Random generation of RNA secondary structures according to native distributions.

Algorithms Mol Biol. 2011 Oct 12;6:24. doi: 10.1186/1748-7188-6-24.

Semantics and ambiguity of stochastic RNA family models.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Mar-Apr;8(2):499-516. doi: 10.1109/TCBB.2010.12.

On quantitative effects of RNA shape abstraction.

Theory Biosci. 2009 Nov;128(4):211-25. doi: 10.1007/s12064-009-0074-z. Epub 2009 Sep 15.

Prediction of RNA secondary structure using generalized centroid estimators.

Bioinformatics. 2009 Feb 15;25(4):465-73. doi: 10.1093/bioinformatics/btn601. Epub 2008 Dec 18.

Shape based indexing for faster search of RNA family databases.

BMC Bioinformatics. 2008 Feb 29;9:131. doi: 10.1186/1471-2105-9-131.

CONTRAfold: RNA secondary structure prediction without physics-based models.

Bioinformatics. 2006 Jul 15;22(14):e90-8. doi: 10.1093/bioinformatics/btl246.

Statistical and Bayesian approaches to RNA secondary structure prediction.

RNA. 2006 Mar;12(3):323-31. doi: 10.1261/rna.2274106.

RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.

RNA. 2005 Aug;11(8):1157-66. doi: 10.1261/rna.2500605.

Rfam: annotating non-coding RNAs in complete genomes.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D121-4. doi: 10.1093/nar/gki081.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction.

Theory Biosci. 2011 Dec;130(4):313-36. doi: 10.1007/s12064-011-0139-7. Epub 2011 Dec 2.

Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures.

BMC Bioinformatics. 2012 Jul 9;13:159. doi: 10.1186/1471-2105-13-159.

CONTRAfold: RNA secondary structure prediction without physics-based models.

Bioinformatics. 2006 Jul 15;22(14):e90-8. doi: 10.1093/bioinformatics/btl246.

Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction.

BMC Bioinformatics. 2004 Jun 4;5:71. doi: 10.1186/1471-2105-5-71.

Statistical and Bayesian approaches to RNA secondary structure prediction.

RNA. 2006 Mar;12(3):323-31. doi: 10.1261/rna.2274106.

Analysis of energy-based algorithms for RNA secondary structure prediction.

BMC Bioinformatics. 2012 Feb 1;13:22. doi: 10.1186/1471-2105-13-22.

SCFGs in RNA secondary structure prediction RNA secondary structure prediction: a hands-on approach.

Methods Mol Biol. 2014;1097:143-62. doi: 10.1007/978-1-62703-709-9_8.

A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.

RNA. 2012 Feb;18(2):193-212. doi: 10.1261/rna.030049.111. Epub 2011 Dec 22.

Stochastic modeling of RNA pseudoknotted structures: a grammatical approach.

Bioinformatics. 2003;19 Suppl 1:i66-73. doi: 10.1093/bioinformatics/btg1007.

RNA Secondary Structure Thermodynamics.

Methods Mol Biol. 2024;2726:45-83. doi: 10.1007/978-1-0716-3519-3_3.

引用本文的文献

Markov Chain-Based Sampling for Exploring RNA Secondary Structure under the Nearest Neighbor Thermodynamic Model and Extended Applications.

Math Comput Appl. 2020 Dec;25(4). doi: 10.3390/mca25040067. Epub 2020 Oct 10.

RNA folding with hard and soft constraints.

Algorithms Mol Biol. 2016 Apr 23;11:8. doi: 10.1186/s13015-016-0070-z. eCollection 2016.

The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective.

RNA Biol. 2013 Jul;10(7):1185-96. doi: 10.4161/rna.24971. Epub 2013 May 10.

Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures.

BMC Bioinformatics. 2012 Jul 9;13:159. doi: 10.1186/1471-2105-13-159.

本文引用的文献

A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.

RNA. 2012 Feb;18(2):193-212. doi: 10.1261/rna.030049.111. Epub 2011 Dec 22.

Random generation of RNA secondary structures according to native distributions.

Algorithms Mol Biol. 2011 Oct 12;6:24. doi: 10.1186/1748-7188-6-24.

Semantics and ambiguity of stochastic RNA family models.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Mar-Apr;8(2):499-516. doi: 10.1109/TCBB.2010.12.

On quantitative effects of RNA shape abstraction.

Theory Biosci. 2009 Nov;128(4):211-25. doi: 10.1007/s12064-009-0074-z. Epub 2009 Sep 15.

Prediction of RNA secondary structure using generalized centroid estimators.

Bioinformatics. 2009 Feb 15;25(4):465-73. doi: 10.1093/bioinformatics/btn601. Epub 2008 Dec 18.

Shape based indexing for faster search of RNA family databases.

BMC Bioinformatics. 2008 Feb 29;9:131. doi: 10.1186/1471-2105-9-131.

CONTRAfold: RNA secondary structure prediction without physics-based models.

Bioinformatics. 2006 Jul 15;22(14):e90-8. doi: 10.1093/bioinformatics/btl246.

Statistical and Bayesian approaches to RNA secondary structure prediction.

RNA. 2006 Mar;12(3):323-31. doi: 10.1261/rna.2274106.

RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.

RNA. 2005 Aug;11(8):1157-66. doi: 10.1261/rna.2500605.

Rfam: annotating non-coding RNAs in complete genomes.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D121-4. doi: 10.1093/nar/gki081.

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献