Nebel Markus E, Scheid Anika, Weinberg Frank
Department of Computer Science, University of Kaiserslautern, Germany.
Algorithms Mol Biol. 2011 Oct 12;6:24. doi: 10.1186/1748-7188-6-24.
Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.
In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time O(n2) for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity Om⋅n⋅ log(n) while other algorithms typically have a runtime in O(m⋅n2). Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.
A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.
随机生物序列是基因组分析中一个备受关注的主题,因为根据一个强大的范式,它们代表了实际生物信息必须从中区分出来的背景噪声。因此,随机序列的生成已经被研究了很长时间。同样,像RNA分子或蛋白质这样具有更复杂结构的随机对象也备受关注。
在本文中,我们提出了一个新的通用框架,用于根据随机上下文无关文法所隐含的编码和概率分布,推导组合对象非均匀随机生成的算法。简而言之,该框架扩展了用于(均匀)随机生成的著名递归方法,并使用了组合类可允许规范的流行框架,引入加权组合类以允许通过逆序排序进行非均匀生成。这个框架被用于推导一个生成给定固定大小RNA二级结构的算法。我们通过使用一个非常详细的上下文无关文法(通过区分RNA结构中的所有已知基序来对RNA二级结构类进行建模),根据从实际数据获得的现实分布来处理这些结构的随机生成。与几种结构预测工具(如SFold)中使用的著名采样方法相比,我们的方法有两个主要优点:首先,在进行一次时间复杂度为O(n²)的预处理步骤以计算所需的所有加权类大小之后,使用我们的方法,在最坏情况下,一组大小为n的m个随机二级结构可以在时间复杂度O(m·n·log(n))内计算出来,而其他算法通常具有O(m·n²)的运行时间。其次,我们的方法仅使用整数运算,这更快,并且使我们避免了使用带有对数概率的浮点运算所带来的所有麻烦细节。
大量实验结果表明,我们的随机生成方法至少在不同结构基序的出现方面产生了现实的输出。该算法可作为网络服务在http://wwwagak.cs.uni-kl.de/NonUniRandGen上获取,可用于生成任何指定RNA类型的随机二级结构。在那里也可以找到下载我们方法实现(用Wolfram Mathematica)的链接。