Department of Computer Science, University of Kaiserslautern, D-67653 Kaiserslautern, P.O. Box 3049, Germany.
BMC Bioinformatics. 2012 Jul 9;13:159. doi: 10.1186/1471-2105-13-159.
Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other.
In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case - without sacrificing much of the accuracy of the results.
Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms.
在过去的几年中,统计和贝叶斯方法已经越来越多地被用于解决长期存在的计算 RNA 结构预测问题。最近,研究了一种从单个序列预测 RNA 二级结构的新的概率方法,该方法基于为特定输入序列的整个可行结构集合生成具有统计代表性和可重现性的样本。该方法从一个复杂的(传统或长度依赖的)随机上下文无关语法(SCFG)所暗示的分布中对可能的折叠进行采样,该语法反映了现代基于物理的预测算法中应用的标准热力学模型。具体来说,该语法表示 Sfold 软件所基于的能量模型的精确概率对应物,该软件采用分配函数(PF)方法的扩展来生成统计上有代表性的 Boltzmann 加权集合的子集。尽管这两种采样方法的最坏情况时间和空间复杂度相同,但已经表明它们在性能上有所不同(在预测准确性和生成样本的质量方面),这两种竞争方法都没有一种普遍优于另一种。
在这项工作中,我们将考虑基于 SCFG 的方法,以分析当将不同程度的干扰纳入所需的采样概率时,生成的样本集的质量和相应的预测准确性如何变化。这是因为如果结果证明对不同的采样概率(与精确概率相比)具有较大的误差具有抗性,那么这表明这些概率不需要精确计算,而是可以近似计算。因此,在不显着降低准确性的情况下,可能可以降低基于此类 SCFG 的采样方法的最坏情况时间要求。另一方面,如果可以观察到采样结构的质量对轻微干扰强烈反应,则通过启发式程序来改善复杂性的希望很小。因此,我们提供了一个可靠的测试,以验证以下假设:可以实施启发式方法来改善最坏情况下的 RNA 二级结构预测的时间缩放,而不会牺牲结果的准确性。
我们的实验表明,绝对误差通常会导致生成无用的样本集,而相对误差似乎对预测准确性和生成结构样本的整体质量都只有很小的负面影响。基于这些观察结果,我们提出了一些有用的想法,以开发一种可保证可接受的预测准确性的时间减少的采样方法。我们还讨论了在近似情况下出现的一些固有缺点。本文的关键结果对于设计基于越来越受欢迎和有吸引力的统计采样方法的高效竞争启发式预测方法至关重要。这确实是通过构建原型算法来指示的。