Akiyama Manato, Sato Kengo, Sakakibara Yasubumi
Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.
J Bioinform Comput Biol. 2018 Dec;16(6):1840025. doi: 10.1142/S0219720018400255.
A popular approach for predicting RNA secondary structure is the thermodynamic nearest-neighbor model that finds a thermodynamically most stable secondary structure with minimum free energy (MFE). For further improvement, an alternative approach that is based on machine learning techniques has been developed. The machine learning-based approach can employ a fine-grained model that includes much richer feature representations with the ability to fit the training data. Although a machine learning-based fine-grained model achieved extremely high performance in prediction accuracy, a possibility of the risk of overfitting for such a model has been reported. In this paper, we propose a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning-based weighted approach. Our fine-grained model combines the experimentally determined thermodynamic parameters with a large number of scoring parameters for detailed contexts of features that are trained by the structured support vector machine (SSVM) with the regularization to avoid overfitting. Our benchmark shows that our algorithm achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed. The implementation of our algorithm is available at https://github.com/keio-bioinformatics/mxfold .
一种预测RNA二级结构的常用方法是热力学最近邻模型,该模型能找到具有最小自由能(MFE)的热力学上最稳定的二级结构。为了进一步改进,人们开发了一种基于机器学习技术的替代方法。基于机器学习的方法可以采用细粒度模型,该模型包含更丰富的特征表示,能够拟合训练数据。尽管基于机器学习的细粒度模型在预测准确性方面取得了极高的性能,但已有报道称这种模型存在过拟合风险。在本文中,我们提出了一种用于RNA二级结构预测的新算法,该算法整合了热力学方法和基于机器学习的加权方法。我们的细粒度模型将实验确定的热力学参数与大量用于详细特征上下文的评分参数相结合,这些参数由结构化支持向量机(SSVM)进行训练,并采用正则化来避免过拟合。我们的基准测试表明,与现有方法相比,我们的算法实现了最佳的预测准确性,并且未观察到严重的过拟合现象。我们算法的实现可在https://github.com/keio-bioinformatics/mxfold获取。