Okada Yohei, Sato Kengo, Sakakibara Yasubumi
Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan.
Pac Symp Biocomput. 2010:88-97. doi: 10.1142/9789814295291_0011.
RNAz, a support vector machine (SVM) approach for identifying functional non-coding RNAs (ncRNAs), has been proven to be one of the most accurate tools for this goal. Among the measurements used in RNAz, the Structure Conservation Index (SCI) which evaluates the evolutionary conservation of RNA secondary structures in terms of folding energies, has been reported to have an extremely high discrimination capability. However, for practical use of RNAz on the genome-wide search, a relatively high false discovery rate has unfortunately been estimated. It is conceivable that multiple alignments produced by a standard aligner that does not consider any secondary structures are not suitable for identifying ncRNAs in some cases and incur high false discovery rate. In this study, we propose C-SCI, an improved measurement based on the SCI applying gamma-centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that the C-SCI achieves higher accuracy than the original SCI for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the accuracy of the C-SCI on CLUSTAL W alignments is comparable with that of the original SCI on structural alignments generated with RAF for which 4.7-fold expensive computational time is required on average.
RNAz是一种用于识别功能性非编码RNA(ncRNA)的支持向量机(SVM)方法,已被证明是实现这一目标最准确的工具之一。在RNAz使用的度量中,结构保守指数(SCI)根据折叠能量评估RNA二级结构的进化保守性,据报道具有极高的区分能力。然而,遗憾的是,在全基因组搜索中实际使用RNAz时,估计有相对较高的错误发现率。可以想象,由不考虑任何二级结构的标准比对工具产生的多序列比对在某些情况下不适用于识别ncRNA,并导致高错误发现率。在本研究中,我们提出了C-SCI,这是一种基于SCI的改进度量,应用伽马质心估计器以纳入针对低质量多序列比对的稳健性。我们的实验表明,C-SCI不仅对于人工整理的结构比对,而且对于CLUSTAL W产生的低质量比对,都比原始SCI具有更高的准确性。此外,C-SCI在CLUSTAL W比对上的准确性与原始SCI在使用RAF生成的结构比对上的准确性相当,而使用RAF平均需要4.7倍的计算时间。