Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland.
Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad153.
Since the 1980s, dozens of computational methods have addressed the problem of predicting RNA secondary structure. Among them are those that follow standard optimization approaches and, more recently, machine learning (ML) algorithms. The former were repeatedly benchmarked on various datasets. The latter, on the other hand, have not yet undergone extensive analysis that could suggest to the user which algorithm best fits the problem to be solved. In this review, we compare 15 methods that predict the secondary structure of RNA, of which 6 are based on deep learning (DL), 3 on shallow learning (SL) and 6 control methods on non-ML approaches. We discuss the ML strategies implemented and perform three experiments in which we evaluate the prediction of (I) representatives of the RNA equivalence classes, (II) selected Rfam sequences and (III) RNAs from new Rfam families. We show that DL-based algorithms (such as SPOT-RNA and UFold) can outperform SL and traditional methods if the data distribution is similar in the training and testing set. However, when predicting 2D structures for new RNA families, the advantage of DL is no longer clear, and its performance is inferior or equal to that of SL and non-ML methods.
自 20 世纪 80 年代以来,已有数十种计算方法致力于解决 RNA 二级结构预测的问题。其中包括遵循标准优化方法的方法,以及最近的机器学习 (ML) 算法。前者在各种数据集上进行了反复的基准测试。另一方面,后者尚未进行广泛的分析,无法向用户建议哪种算法最适合要解决的问题。在这篇综述中,我们比较了 15 种预测 RNA 二级结构的方法,其中 6 种基于深度学习 (DL),3 种基于浅层学习 (SL),6 种控制方法基于非 ML 方法。我们讨论了所实现的 ML 策略,并进行了三个实验,其中我们评估了 (I) RNA 等价类代表、(II) 选定的 Rfam 序列和 (III) 来自新 Rfam 家族的 RNA 的预测。我们表明,如果训练集和测试集中的数据分布相似,基于 DL 的算法(如 SPOT-RNA 和 UFold)可以优于 SL 和传统方法。然而,当预测新 RNA 家族的 2D 结构时,DL 的优势不再明显,其性能不如 SL 和非 ML 方法。