Suppr超能文献

通过扩展训练数据改进RNA二级结构预测

Improving RNA Secondary Structure Prediction Through Expanded Training Data.

作者信息

Langeberg Conner J, Kim Taehan, Nagle Roma, Meredith Charlotte, Garuadapuri Dimple Amitha, Doudna Jennifer A, Cate Jamie H D

机构信息

Innovative Genomics Institute; University of California, Berkeley, CA, USA.

California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA, USA.

出版信息

bioRxiv. 2025 May 3:2025.05.03.652028. doi: 10.1101/2025.05.03.652028.

Abstract

In recent years, deep learning has revolutionized protein structure prediction, achieving remarkable speed and accuracy. RNA structure prediction, however, has lagged behind. Although several methods have shown moderate success in predicting RNA secondary and tertiary structures, none have reached the accuracy observed with contemporary protein models. The lack of success of these RNA structure prediction models has been proposed to be due to limited high-quality structural information that can be used as training data. To probe this proposed limitation, we developed a large and diverse dataset comprising paired RNA sequences and their corresponding secondary structures. We assess the utility of this enhanced dataset by retraining two deep learning models, SincFold and MXfold2. We find that SincFold exhibited improved generalization to some previously unseen RNA families, enhancing its capability to predict accurate de novo RNA secondary structures. By contrast, retraining MXfold2 proved too computationally expensive for the large RNASSTR dataset and did not achieve high performance on the testing set. The RNASSTR dataset provides a substantial advance for RNA structure modeling, laying a strong foundation for the development of future RNA secondary structure prediction algorithms.

摘要

近年来,深度学习彻底改变了蛋白质结构预测,在速度和准确性方面取得了显著成就。然而,RNA结构预测却滞后了。尽管有几种方法在预测RNA二级和三级结构方面取得了一定成功,但没有一种能达到当代蛋白质模型所具有的准确性。这些RNA结构预测模型未能成功的原因被认为是可用于训练数据的高质量结构信息有限。为了探究这一假定的局限性,我们开发了一个庞大且多样的数据集,其中包含配对的RNA序列及其相应的二级结构。我们通过重新训练两个深度学习模型SincFold和MXfold2来评估这个增强数据集的效用。我们发现,SincFold对一些以前未见过的RNA家族表现出更好的泛化能力,并增强了其预测准确的从头RNA二级结构的能力。相比之下,重新训练MXfold2对于庞大的RNASSTR数据集来说计算成本过高,并且在测试集上没有取得高性能。RNASSTR数据集为RNA结构建模提供了实质性进展,并为未来RNA二级结构预测算法的开发奠定了坚实基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6972/12247784/99be46228d4d/nihpp-2025.05.03.652028v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验