Badura Jan, Rybarczyk Agnieszka, Zok Tomasz
Institute of Computing Science, Poznan University of Technology, 60-965, Poznan, Poland.
Institute of Bioorganic Chemistry, Polish Academy of Sciences, 61-704, Poznan, Poland.
Sci Rep. 2025 Jul 1;15(1):21417. doi: 10.1038/s41598-025-07041-2.
RNA molecules are essential in regulating biological processes such as gene expression, cellular differentiation, and development. Accurately predicting RNA secondary structures and designing sequences that fold into specific configurations remain significant challenges in computational biology, with far-reaching implications for medicine, synthetic biology, and biotechnology. While machine learning methodologies have been proposed to enhance prediction capabilities, they require high-quality training data. The lack of standardized benchmark datasets further hinders the development and evaluation of these tools. To address this, we created a comprehensive dataset of over 320 thousand instances from experimentally validated sources to establish a new community-wide benchmark for RNA design and modeling algorithms. Our dataset comprises numerous challenging structures for which state-of-the-art RNA inverse folders provide results of varying accuracy. We demonstrated the potential of the dataset by testing it with several popular open-source RNA design algorithms. Furthermore, we illustrated how our dataset can be used to train machine learning models that consider both RNA sequence and structure, potentially advancing RNA design and prediction capabilities.
RNA分子在调节生物过程(如基因表达、细胞分化和发育)中至关重要。准确预测RNA二级结构并设计能折叠成特定构型的序列,在计算生物学中仍然是重大挑战,对医学、合成生物学和生物技术有着深远影响。虽然已提出机器学习方法来增强预测能力,但它们需要高质量的训练数据。缺乏标准化的基准数据集进一步阻碍了这些工具的开发和评估。为解决这一问题,我们从经过实验验证的来源创建了一个包含超过32万个实例的综合数据集,为RNA设计和建模算法建立了一个新的全社区范围的基准。我们的数据集包含许多具有挑战性的结构,对于这些结构,最先进的RNA反向折叠器提供的结果准确性各异。我们通过用几种流行的开源RNA设计算法对其进行测试,展示了该数据集的潜力。此外,我们说明了如何使用我们的数据集来训练同时考虑RNA序列和结构的机器学习模型,这可能会推动RNA设计和预测能力的提升。