通过扩展训练数据改进RNA二级结构预测

Improving RNA Secondary Structure Prediction Through Expanded Training Data.

作者信息

Langeberg Conner J, Kim Taehan, Nagle Roma, Meredith Charlotte, Garuadapuri Dimple Amitha, Doudna Jennifer A, Cate Jamie H D

机构信息

Innovative Genomics Institute; University of California, Berkeley, CA, USA.

California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA, USA.

出版信息

bioRxiv. 2025 May 3:2025.05.03.652028. doi: 10.1101/2025.05.03.652028.

DOI:10.1101/2025.05.03.652028

PMID:40654677

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12247784/

Abstract

In recent years, deep learning has revolutionized protein structure prediction, achieving remarkable speed and accuracy. RNA structure prediction, however, has lagged behind. Although several methods have shown moderate success in predicting RNA secondary and tertiary structures, none have reached the accuracy observed with contemporary protein models. The lack of success of these RNA structure prediction models has been proposed to be due to limited high-quality structural information that can be used as training data. To probe this proposed limitation, we developed a large and diverse dataset comprising paired RNA sequences and their corresponding secondary structures. We assess the utility of this enhanced dataset by retraining two deep learning models, SincFold and MXfold2. We find that SincFold exhibited improved generalization to some previously unseen RNA families, enhancing its capability to predict accurate de novo RNA secondary structures. By contrast, retraining MXfold2 proved too computationally expensive for the large RNASSTR dataset and did not achieve high performance on the testing set. The RNASSTR dataset provides a substantial advance for RNA structure modeling, laying a strong foundation for the development of future RNA secondary structure prediction algorithms.

摘要

近年来，深度学习彻底改变了蛋白质结构预测，在速度和准确性方面取得了显著成就。然而，RNA结构预测却滞后了。尽管有几种方法在预测RNA二级和三级结构方面取得了一定成功，但没有一种能达到当代蛋白质模型所具有的准确性。这些RNA结构预测模型未能成功的原因被认为是可用于训练数据的高质量结构信息有限。为了探究这一假定的局限性，我们开发了一个庞大且多样的数据集，其中包含配对的RNA序列及其相应的二级结构。我们通过重新训练两个深度学习模型SincFold和MXfold2来评估这个增强数据集的效用。我们发现，SincFold对一些以前未见过的RNA家族表现出更好的泛化能力，并增强了其预测准确的从头RNA二级结构的能力。相比之下，重新训练MXfold2对于庞大的RNASSTR数据集来说计算成本过高，并且在测试集上没有取得高性能。RNASSTR数据集为RNA结构建模提供了实质性进展，并为未来RNA二级结构预测算法的开发奠定了坚实基础。