Xu Xing, Ji Yongmei, Stormo Gary D
Department of Genetics, Washington University, School of Medicine, St. Louis, MO 63110, USA.
Bioinformatics. 2007 Aug 1;23(15):1883-91. doi: 10.1093/bioinformatics/btm272. Epub 2007 May 30.
Non-coding RNA genes and RNA structural regulatory motifs play important roles in gene regulation and other cellular functions. They are often characterized by specific secondary structures that are critical to their functions and are often conserved in phylogenetically or functionally related sequences. Predicting common RNA secondary structures in multiple unaligned sequences remains a challenge in bioinformatics research.
We present a new sampling based algorithm to predict common RNA secondary structures in multiple unaligned sequences. Our algorithm finds the common structure between two sequences by probabilistically sampling aligned stems based on stem conservation calculated from intrasequence base pairing probabilities and intersequence base alignment probabilities. It iteratively updates these probabilities based on sampled structures and subsequently recalculates stem conservation using the updated probabilities. The iterative process terminates upon convergence of the sampled structures. We extend the algorithm to multiple sequences by a consistency-based method, which iteratively incorporates and reinforces consistent structure information from pairwise comparisons into consensus structures. The algorithm has no limitation on predicting pseudoknots. In extensive testing on real sequence data, our algorithm outperformed other leading RNA structure prediction methods in both sensitivity and specificity with a reasonably fast speed. It also generated better structural alignments than other programs in sequences of a wide range of identities, which more accurately represent the RNA secondary structure conservations.
The algorithm is implemented in a C program, RNA Sampler, which is available at http://ural.wustl.edu/software.html
非编码RNA基因和RNA结构调控基序在基因调控和其他细胞功能中发挥着重要作用。它们通常具有特定的二级结构,这些结构对其功能至关重要,并且在系统发育或功能相关的序列中往往是保守的。预测多个未比对序列中的常见RNA二级结构仍然是生物信息学研究中的一个挑战。
我们提出了一种基于采样的新算法,用于预测多个未比对序列中的常见RNA二级结构。我们的算法通过基于从序列内碱基配对概率和序列间碱基比对概率计算出的茎保守性,对比对的茎进行概率采样,来找到两个序列之间的共同结构。它根据采样结构迭代更新这些概率,随后使用更新后的概率重新计算茎保守性。当采样结构收敛时,迭代过程终止。我们通过一种基于一致性的方法将该算法扩展到多个序列,该方法迭代地将成对比较中的一致结构信息纳入并强化到共有结构中。该算法在预测假结方面没有限制。在对真实序列数据的广泛测试中,我们的算法在敏感性和特异性方面均优于其他领先的RNA结构预测方法,且速度合理较快。在各种同一性的序列中,它还比其他程序生成了更好的结构比对,能更准确地表示RNA二级结构的保守性。
该算法用C程序RNA Sampler实现,可在http://ural.wustl.edu/software.html获取。