Suppr超能文献

MARS 和 RNAcmap3:整合了 RNAcmap 的所有可能 RNA 序列的主数据库,用于 RNA 同源性搜索。

MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search.

机构信息

Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.

Peking University Shenzhen Graduate School, Shenzhen 518055, China.

出版信息

Genomics Proteomics Bioinformatics. 2024 May 9;22(1). doi: 10.1093/gpbjnl/qzae018.

Abstract

Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by incorporating the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to the nucleotide (nt) database and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI's nt database or 60-fold larger than RNAcentral. The new dataset along with a new split-search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037, and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.

摘要

最近,AlphaFold2 在蛋白质结构预测方面的成功在很大程度上依赖于同源蛋白质序列的共进化信息,这些信息来自于庞大的蛋白质序列综合数据库(Big Fantastic Database)。相比之下,现有的核苷酸数据库并没有被整合,以方便更广泛和更深层次的同源搜索。在这里,我们通过整合来自 RNAcentral 的非编码 RNA(ncRNA)序列、来自宏基因组 RAST(MG-RAST)的转录组组装和宏基因组组装、来自 Genome Warehouse(GWH)的基因组序列以及来自 MGnify 的基因组序列,构建了一个全面的数据库,除了核苷酸(nt)数据库及其在国家生物技术信息中心(NCBI)中的子集。由此产生的所有可能 RNA 序列的主数据库(MARS)比 NCBI 的 nt 数据库大 20 倍,比 RNAcentral 大 60 倍。新数据集和新的分割搜索策略使得同源搜索相对于现有最先进技术有了实质性的改进。它还产生了比从 Rfam 手工整理的 MSAs 更准确和更敏感的多重序列比对(MSAs),用于大多数映射到 Rfam 的结构化 RNA。结果表明,MARS 与全自动同源搜索工具 RNAcmap 结合使用,将有助于改进基于 MSAs 的 ncRNA 和 RNA 语言模型的结构和功能推断。MARS 可在 https://ngdc.cncb.ac.cn/omix/release/OMIX003037 访问,而 RNAcmap3 可在 http://zhouyq-lab.szbl.ac.cn/download/ 访问。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验