Suppr超能文献

用于RNA二级结构预测的复杂SCFG设计评估。

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction.

作者信息

Nebel Markus E, Scheid Anika

机构信息

Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany.

出版信息

Theory Biosci. 2011 Dec;130(4):313-36. doi: 10.1007/s12064-011-0139-7. Epub 2011 Dec 2.

Abstract

Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.

摘要

预测RNA分子的二级结构是计算结构生物学的基本问题之一,也是一项具有挑战性的任务。在过去几十年里,主要考虑了两种不同的方法来根据单个序列计算RNA二级结构的预测:第一种方法依赖于基于物理的模型,另一种则依赖于概率RNA模型。特别地,自由能最小化(MFE)方法通常被认为是最流行和成功的方法。此外,基于McCaskill的范式转换工作,该工作提出了基于热力学计算配分函数(PFs)和碱基对概率的方法,在过去几年中发明了几种扩展的配分函数算法、统计采样方法和聚类技术。然而,相应算法的准确性受到基础物理模型质量的限制,这些模型包含大量热力学参数且仍不完整。与之竞争的概率方法基于随机上下文无关文法(SCFGs)或相应的推广,如条件对数线性模型(CLLMs)。这些方法不考虑自由能,而是试图通过从可信的RNA结构数据库中学习(数量可控的)概率参数来了解分子的结构行为。在这项工作中,我们引入并评估了一种复杂的SCFG设计,该设计通过区分RNA的所有暗示不同能量规则的特征,来反映基于物理的RNA结构预测的最新方法。这种SCFG实际上是一种用于单序列RNA二级结构的统计采样算法的基础,该算法代表了PF方法采样扩展的概率对应物。此外,还提出了一些从生成的样本集中导出有意义结构预测的新方法。它们用于将我们模型的预测准确性与其他概率和基于能量的预测方法进行比较。特别地,与用于RNA结构预测的轻量级SCFG和相应的CLLM的比较表明,更复杂的SCFG设计可能会产生更高的准确性,但最终需要更全面和纯净的训练集。对预测折叠的准确性和生成样本集的整体质量(特别是在与生物学家相关的抽象层面,即生成结构的抽象形状)的研究得出结论,PF采样方法的玻尔兹曼分布比复杂SCFG模型诱导的系综分布更集中,这意味着生成样本中的结构多样性更大。一般来说,这两种不同的系综分布都不比另一种更合适,并且通过统计采样获得的相应结果预计会有根本差异,因此对于特定输入序列首选的方法强烈依赖于所考虑的RNA类型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验