Zhang Boyu, Yehdego Daniel T, Johnson Kyle L, Leung Ming-Ying, Taufer Michela
BMC Struct Biol. 2013;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1472-6807-13-S1-S3. Epub 2013 Nov 8.
Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment.
On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance.
By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.
核糖核酸(RNA)分子在包括基因表达和调控在内的许多生物过程中发挥着重要作用。其二级结构对于RNA的功能至关重要,并且二级结构的预测受到广泛研究。我们之前的研究表明,将长序列切割成较短的片段,使用热力学方法独立预测片段的二级结构,并从预测的片段结构重建整个二级结构,比使用RNA序列整体预测二级结构能产生更高的准确性。片段化、预测和重建过程可以使用不同的方法和参数,其中一些方法产生的预测比其他方法更准确。在本文中,我们使用七个流行的二级结构预测程序,研究三种不同片段化方法的预测准确性和效率,这些程序应用于两个具有已知二级结构的RNA数据集,其中包括假结和非假结序列,以及一个以前未预测过结构的病毒基因组RNA家族。我们基于Hadoop的模块化MapReduce框架使我们能够在并行且强大的环境中研究该问题。
平均而言,对于50个非假结序列,我们的片段化方法和七个预测程序的最大准确性保留值大于1,这意味着使用片段化预测的二级结构比使用整个序列预测的二级结构更类似于真实结构。对于23个假结序列,我们观察到类似的结果,但使用中心片段化方法的NUPACK程序除外。对来自诺达病毒科病毒家族的14个长RNA序列的性能分析概述了MapReduce框架中片段化和预测的粗粒度映射如何在短RNA序列中表现出更短的周转时间。然而,随着RNA序列长度的增加,细粒度映射在性能上可以超过粗粒度映射。
通过将我们的MapReduce框架与准确性保留结果的统计分析相结合,我们观察到基于反转的片段化方法如何优于使用整个序列的预测。我们基于片段的方法还使我们能够预测非常长的RNA序列的二级结构,这仅用传统方法是不可行的。