Université Paris-Saclay, Univ Evry, IBISC, Evry-Courcouronnes 91020, France.
Bioinformatics. 2021 Jun 9;37(9):1218-1224. doi: 10.1093/bioinformatics/btaa944.
Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning.
Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided.
The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification.
Supplementary data are available at Bioinformatics online.
当有干净且可使用的数据集时,机器学习的应用研究进展得更快。多年来,已经提出并发布了几个数据集,用于特定任务,如图像分类、语音识别,最近还用于蛋白质结构预测。然而,对于 RNA 结构预测这一基本问题,信息分散在几个数据库中,具体取决于我们感兴趣的级别:序列、二级结构、三维结构或与其他大分子的相互作用。为了加快基于机器学习的 RNA 二级和/或三维结构预测方法的进展,需要整合所有这些信息的数据集,以避免在数据收集和清理上花费时间。
在这里,我们首次尝试构建一个标准化的、自动生成的数据集,该数据集专门用于 RNA,结合了以下内容:RNA 序列、同源信息(以位置特异性评分矩阵的形式)和通过可用三维结构注释获得的信息(包括二级结构、规范和非规范相互作用以及骨架扭转角)。数据从公共数据库 PDB、Rfam 和 SILVA 中检索。本文描述了构建此类数据集的过程以及我们提供的 RNA 结构描述符。还提供了对生成数据集的一些统计描述。
该数据集每月更新一次,并可在 EvryRNA 软件平台(https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet)上以纯文本文件格式在线获取。还提供了一个有效的并行构建数据集的管道,便于复制或修改。
补充数据可在《生物信息学》在线获取。