Anwar Mohammad, Nguyen Truong, Turcotte Marcel
School of Information Technology and Engineering, University of Ottawa, Ottawa, Ontario, Canada.
BMC Bioinformatics. 2006 May 5;7:244. doi: 10.1186/1471-2105-7-244.
The identification of a consensus RNA motif often consists in finding a conserved secondary structure with minimum free energy in an ensemble of aligned sequences. However, an alignment is often difficult to obtain without prior structural information. Thus the need for tools to automate this process.
We present an algorithm called Seed to identify all the conserved RNA secondary structure motifs in a set of unaligned sequences. The search space is defined as the set of all the secondary structure motifs inducible from a seed sequence. A general-to-specific search allows finding all the motifs that are conserved. Suffix arrays are used to enumerate efficiently all the biological palindromes as well as for the matching of RNA secondary structure expressions. We assessed the ability of this approach to uncover known structures using four datasets. The enumeration of the motifs relies only on the secondary structure definition and conservation only, therefore allowing for the independent evaluation of scoring schemes. Twelve simple objective functions based on free energy were evaluated for their potential to discriminate native folds from the rest.
Our evaluation shows that 1) support and exclusion constraints are sufficient to make an exhaustive search of the secondary structure space feasible. 2) The search space induced from a seed sequence contains known motifs. 3) Simple objective functions, consisting of a combination of the free energy of matching sequences, can generally identify motifs with high positive predictive value and sensitivity to known motifs.
识别共有RNA基序通常在于在一组比对序列中找到具有最小自由能的保守二级结构。然而,在没有先验结构信息的情况下,通常很难获得比对结果。因此需要自动化此过程的工具。
我们提出了一种名为Seed的算法,用于识别一组未比对序列中的所有保守RNA二级结构基序。搜索空间被定义为可从种子序列诱导出的所有二级结构基序的集合。从一般到特殊的搜索允许找到所有保守的基序。后缀数组用于高效枚举所有生物回文序列以及用于RNA二级结构表达式的匹配。我们使用四个数据集评估了这种方法揭示已知结构的能力。基序的枚举仅依赖于二级结构定义和保守性,因此允许对评分方案进行独立评估。基于自由能评估了十二个简单目标函数区分天然折叠与其他结构的潜力。
我们的评估表明:1)支持和排除约束足以使对二级结构空间的穷举搜索可行。2)从种子序列诱导出的搜索空间包含已知基序。3)由匹配序列的自由能组合而成的简单目标函数通常可以识别出对已知基序具有高阳性预测值和敏感性的基序。