RNANet：一个自动构建的双源数据集，整合了同源序列和 RNA 结构。

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.

机构信息

Université Paris-Saclay, Univ Evry, IBISC, Evry-Courcouronnes 91020, France.

出版信息

Bioinformatics. 2021 Jun 9;37(9):1218-1224. doi: 10.1093/bioinformatics/btaa944.

DOI:10.1093/bioinformatics/btaa944

PMID:33135044

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8189678/

Abstract

MOTIVATION

Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning.

RESULTS

Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided.

AVAILABILITY AND IMPLEMENTATION

The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

当有干净且可使用的数据集时，机器学习的应用研究进展得更快。多年来，已经提出并发布了几个数据集，用于特定任务，如图像分类、语音识别，最近还用于蛋白质结构预测。然而，对于 RNA 结构预测这一基本问题，信息分散在几个数据库中，具体取决于我们感兴趣的级别：序列、二级结构、三维结构或与其他大分子的相互作用。为了加快基于机器学习的 RNA 二级和/或三维结构预测方法的进展，需要整合所有这些信息的数据集，以避免在数据收集和清理上花费时间。

结果

在这里，我们首次尝试构建一个标准化的、自动生成的数据集，该数据集专门用于 RNA，结合了以下内容：RNA 序列、同源信息（以位置特异性评分矩阵的形式）和通过可用三维结构注释获得的信息（包括二级结构、规范和非规范相互作用以及骨架扭转角）。数据从公共数据库 PDB、Rfam 和 SILVA 中检索。本文描述了构建此类数据集的过程以及我们提供的 RNA 结构描述符。还提供了对生成数据集的一些统计描述。

可用性和实现

该数据集每月更新一次，并可在 EvryRNA 软件平台（https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet）上以纯文本文件格式在线获取。还提供了一个有效的并行构建数据集的管道，便于复制或修改。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/89ff/8189678/76b615f26623/btaa944f1.jpg

相似文献

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.RNANet：一个自动构建的双源数据集，整合了同源序列和 RNA 结构。

Bioinformatics. 2021 Jun 9;37(9):1218-1224. doi: 10.1093/bioinformatics/btaa944.

C-RCPred: a multi-objective algorithm for interactive secondary structure prediction of RNA complexes integrating user knowledge and SHAPE data.C-RCPred：一种集成用户知识和 SHAPE 数据的 RNA 复合物交互二级结构预测的多目标算法。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad225.

State-of-the-RNArt: benchmarking current methods for RNA 3D structure prediction.RNA领域现状：RNA三维结构预测当前方法的基准测试

NAR Genom Bioinform. 2024 May 14;6(2):lqae048. doi: 10.1093/nargab/lqae048. eCollection 2024 Jun.

BiORSEO: a bi-objective method to predict RNA secondary structures with pseudoknots using RNA 3D modules.BiORSEO：一种使用 RNA 3D 模块预测具有假结的 RNA 二级结构的双目标方法。

Bioinformatics. 2020 Apr 15;36(8):2451-2457. doi: 10.1093/bioinformatics/btz962.

RNAdvisor: a comprehensive benchmarking tool for the measure and prediction of RNA structural model quality.RNAdvisor：一种用于衡量和预测RNA结构模型质量的综合基准测试工具。

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae064.

Vfold-Pipeline: a web server for RNA 3D structure prediction from sequences.Vfold-Pipeline：一个从序列预测 RNA 三维结构的网络服务器。

Bioinformatics. 2022 Aug 10;38(16):4042-4043. doi: 10.1093/bioinformatics/btac426.

Bi-objective integer programming for RNA secondary structure prediction with pseudoknots.具有假结的 RNA 二级结构预测的双目标整数规划。

BMC Bioinformatics. 2018 Jan 15;19(1):13. doi: 10.1186/s12859-018-2007-7.

Tfold: efficient in silico prediction of non-coding RNA secondary structures.Tfold：高效的非编码 RNA 二级结构的计算预测。

Nucleic Acids Res. 2010 Apr;38(7):2453-66. doi: 10.1093/nar/gkp1067. Epub 2010 Jan 4.

Towards a piRNA prediction using multiple kernel fusion and support vector machine.基于多核融合与支持向量机的piRNA预测方法

Bioinformatics. 2014 Sep 1;30(17):i364-70. doi: 10.1093/bioinformatics/btu441.

Murlet: a practical multiple alignment tool for structural RNA sequences.Murlet：一种用于结构RNA序列的实用多序列比对工具。

Bioinformatics. 2007 Jul 1;23(13):1588-98. doi: 10.1093/bioinformatics/btm146. Epub 2007 Apr 25.

引用本文的文献

Has AlphaFold3 achieved success for RNA?AlphaFold3在RNA方面取得成功了吗？

Acta Crystallogr D Struct Biol. 2025 Feb 1;81(Pt 2):49-62. doi: 10.1107/S2059798325000592. Epub 2025 Jan 27.

sincFold: end-to-end learning of short- and long-range interactions in RNA secondary structure.sincFold：RNA 二级结构中短程和远程相互作用的端到端学习。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae271.

RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.RNA3DB：一个结构不同的数据集，用于训练和基准测试深度学习模型进行 RNA 结构预测。

J Mol Biol. 2024 Sep 1;436(17):168552. doi: 10.1016/j.jmb.2024.168552. Epub 2024 Mar 27.

Shining a spotlight on m6A and the vital role of RNA modification in endometrial cancer: a review.聚焦m6A与RNA修饰在子宫内膜癌中的重要作用：综述

Front Genet. 2023 Oct 11;14:1247309. doi: 10.3389/fgene.2023.1247309. eCollection 2023.

cgRNASP: coarse-grained statistical potentials with residue separation for RNA structure evaluation.cgRNASP：用于RNA结构评估的具有残基分离的粗粒度统计势

NAR Genom Bioinform. 2023 Mar 3;5(1):lqad016. doi: 10.1093/nargab/lqad016. eCollection 2023 Mar.

RNAapt3D: RNA aptamer 3D-structural modeling database.RNAapt3D：RNA 适体三维结构建模数据库。

Biophys J. 2022 Dec 20;121(24):4770-4776. doi: 10.1016/j.bpj.2022.09.023. Epub 2022 Sep 22.

RNAloops: a database of RNA multiloops.RNA 环：RNA 多环数据库。

Bioinformatics. 2022 Sep 2;38(17):4200-4205. doi: 10.1093/bioinformatics/btac484.

Deep learning models for RNA secondary structure prediction (probably) do not generalize across families.深度学习模型预测 RNA 二级结构（可能）不能跨家族泛化。

Bioinformatics. 2022 Aug 10;38(16):3892-3899. doi: 10.1093/bioinformatics/btac415.

本文引用的文献

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning.使用二维深度神经网络集成和迁移学习进行 RNA 二级结构预测。

Nat Commun. 2019 Nov 27;10(1):5407. doi: 10.1038/s41467-019-13395-9.

RNA 3D structure prediction guided by independent folding of homologous sequences.基于同源序列独立折叠的 RNA 三维结构预测。

BMC Bioinformatics. 2019 Oct 22;20(1):512. doi: 10.1186/s12859-019-3120-y.

Distance-based protein folding powered by deep learning.基于深度学习的距离相关蛋白质折叠。

Proc Natl Acad Sci U S A. 2019 Aug 20;116(34):16856-16865. doi: 10.1073/pnas.1821309116. Epub 2019 Aug 9.

ProteinNet: a standardized data set for machine learning of protein structure.ProteinNet：用于蛋白质结构机器学习的标准化数据集。

BMC Bioinformatics. 2019 Jun 11;20(1):311. doi: 10.1186/s12859-019-2932-0.

End-to-End Differentiable Learning of Protein Structure.端到端可微分蛋白质结构学习

Cell Syst. 2019 Apr 24;8(4):292-301.e3. doi: 10.1016/j.cels.2019.03.006. Epub 2019 Apr 17.

Mining for recurrent long-range interactions in RNA structures reveals embedded hierarchies in network families.挖掘 RNA 结构中的反复长程相互作用揭示了网络家族中的嵌入层次结构。

Nucleic Acids Res. 2018 May 4;46(8):3841-3851. doi: 10.1093/nar/gky197.

Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.RFAM 13.0：转向以基因组为中心的非编码 RNA 家族资源

Nucleic Acids Res. 2018 Jan 4;46(D1):D335-D342. doi: 10.1093/nar/gkx1038.

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索，以分析海量数据集。

Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.

RNA Structure: Advances and Assessment of 3D Structure Prediction.RNA 结构：三维结构预测的进展与评估。

Annu Rev Biophys. 2017 May 22;46:483-503. doi: 10.1146/annurev-biophys-070816-034125. Epub 2017 Mar 30.

DSSR: an integrated software tool for dissecting the spatial structure of RNA.DSSR：一种用于剖析RNA空间结构的集成软件工具。

Nucleic Acids Res. 2015 Dec 2;43(21):e142. doi: 10.1093/nar/gkv716. Epub 2015 Jul 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

RNANet：一个自动构建的双源数据集，整合了同源序列和 RNA 结构。

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献