通过深度表示学习进行RNA结构比对和聚类的信息性RNA碱基嵌入

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.

作者信息

Akiyama Manato, Sakakibara Yasubumi

机构信息

Department of Biosciences and Informatics, Keio University, 223-8522, Japan.

出版信息

NAR Genom Bioinform. 2022 Feb 22;4(1):lqac012. doi: 10.1093/nargab/lqac012. eCollection 2022 Mar.

DOI:10.1093/nargab/lqac012

PMID:35211670

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8862729/

Abstract

Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this 'informative base embedding' and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman-Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of ( ) instead of the ( ) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length .

摘要

通过将深度学习应用于生物分子信息来积极进行有效的嵌入。获得更好的嵌入可以提高下游分析的质量，例如DNA序列基序检测和蛋白质功能预测。在本研究中，我们采用一种预训练算法对RNA碱基进行有效嵌入，以获得语义丰富的表示，并将该算法应用于两个基本的RNA序列问题：结构比对和聚类。通过使用预训练算法，利用来自各种RNA家族的大量RNA序列以位置依赖的方式嵌入RNA的四个碱基，获得了上下文敏感的嵌入表示。结果，不仅每个碱基的碱基信息，而且RNA序列的二级结构和上下文信息都被嵌入。我们将此称为“信息性碱基嵌入”，并使用它在RNA结构比对和RNA家族聚类任务上实现了优于现有最先进方法的准确率。此外，通过将这种信息性碱基嵌入与简单的Needleman-Wunsch比对算法相结合来进行RNA序列比对，我们成功地以（）的时间复杂度计算结构比对，而不是对于长度为的输入RNA序列，Sankoff风格算法的朴素实现的（）时间复杂度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f2/8862729/5e094a4ce59f/lqac012fig1.jpg

相似文献

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.通过深度表示学习进行RNA结构比对和聚类的信息性RNA碱基嵌入

NAR Genom Bioinform. 2022 Feb 22;4(1):lqac012. doi: 10.1093/nargab/lqac012. eCollection 2022 Mar.

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入：核苷酸序列有意义的数值特征表示形式，方便下游分析。

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

Murlet: a practical multiple alignment tool for structural RNA sequences.Murlet：一种用于结构RNA序列的实用多序列比对工具。

Bioinformatics. 2007 Jul 1;23(13):1588-98. doi: 10.1093/bioinformatics/btm146. Epub 2007 Apr 25.

RNA structural alignments, part I: Sankoff-based approaches for structural alignments.RNA结构比对，第一部分：基于 Sankoff 算法的结构比对方法。

Methods Mol Biol. 2014;1097:275-90. doi: 10.1007/978-1-62703-709-9_13.

RNA structural alignments, part II: non-Sankoff approaches for structural alignments.RNA结构比对，第二部分：结构比对的非桑科夫方法。

Methods Mol Biol. 2014;1097:291-301. doi: 10.1007/978-1-62703-709-9_14.

STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time.STRAL：利用碱基配对概率向量在二次时间内对非编码RNA进行渐进比对。

Bioinformatics. 2006 Jul 1;22(13):1593-9. doi: 10.1093/bioinformatics/btl142. Epub 2006 Apr 13.

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.SPARK-MSNA：基于 Apache Spark 的高效算法，用于通过有监督学习对齐多个相似的 DNA/RNA 序列。

Sci Rep. 2019 Apr 29;9(1):6631. doi: 10.1038/s41598-019-42966-5.

Convolutional neural networks for classification of alignments of non-coding RNA sequences.卷积神经网络在非编码 RNA 序列比对分类中的应用。

Bioinformatics. 2018 Jul 1;34(13):i237-i244. doi: 10.1093/bioinformatics/bty228.

Deep forest ensemble learning for classification of alignments of non-coding RNA sequences based on multi-view structure representations.基于多视图结构表示的非编码 RNA 序列比对分类的深度森林集成学习。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa354.

Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies.蛋白质序列嵌入空间的树状图可视化可提高不同蛋白质超家族功能聚类的效果。

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac619.

引用本文的文献

ARCADE: Controllable Codon Design from Foundation Models via Activation Engineering.ARCADE：通过激活工程从基础模型进行可控密码子设计

bioRxiv. 2025 Aug 23:2025.08.19.668819. doi: 10.1101/2025.08.19.668819.

Graph-CRISPR: a gene editing efficiency prediction model based on graph neural network with integrated sequence and secondary structure feature extraction.Graph-CRISPR：一种基于具有整合序列和二级结构特征提取功能的图神经网络的基因编辑效率预测模型。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf410.

RPIPLM: Prediction of ncRNA-protein interaction by post-training a dual-tower pretrained biological model with supervised contrastive learning.RPIPLM：通过使用监督对比学习对双塔预训练生物模型进行训练后预测非编码RNA与蛋白质的相互作用

PLoS One. 2025 Aug 14;20(8):e0329174. doi: 10.1371/journal.pone.0329174. eCollection 2025.

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学：生物信息学中大型语言模型的全面综述

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Application of artificial intelligence large language models in drug target discovery.人工智能大语言模型在药物靶点发现中的应用。

Front Pharmacol. 2025 Jul 8;16:1597351. doi: 10.3389/fphar.2025.1597351. eCollection 2025.

Predicting RNA Structure Utilizing Attention from Pretrained Language Models.利用预训练语言模型的注意力预测RNA结构

J Chem Inf Model. 2025 Jul 14;65(13):6483-6498. doi: 10.1021/acs.jcim.4c02094. Epub 2025 Jul 2.

Decoding the interactions and functions of non-coding RNA with artificial intelligence.利用人工智能解码非编码RNA的相互作用和功能。

Nat Rev Mol Cell Biol. 2025 Jun 19. doi: 10.1038/s41580-025-00857-w.

Foundation models in plant molecular biology: advances, challenges, and future directions.植物分子生物学中的基础模型：进展、挑战与未来方向。

Front Plant Sci. 2025 Jun 3;16:1611992. doi: 10.3389/fpls.2025.1611992. eCollection 2025.

Genome language modeling (GLM): a beginner's cheat sheet.基因组语言建模（GLM）：初学者简易指南。

Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.

AI-powered precision medicine: utilizing genetic risk factor optimization to revolutionize healthcare.人工智能驱动的精准医学：利用遗传风险因素优化彻底改变医疗保健。

NAR Genom Bioinform. 2025 May 5;7(2):lqaf038. doi: 10.1093/nargab/lqaf038. eCollection 2025 Jun.

本文引用的文献

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

RNA secondary structure prediction using deep learning with thermodynamic integration.使用热力学积分的深度学习进行 RNA 二级结构预测。

Nat Commun. 2021 Feb 11;12(1):941. doi: 10.1038/s41467-021-21194-4.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。

Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。

Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.

TOPAS: network-based structural alignment of RNA sequences.TOPAS：基于网络的 RNA 序列结构比对。

Bioinformatics. 2019 Sep 1;35(17):2941-2948. doi: 10.1093/bioinformatics/btz001.

A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model.一种结合热力学模型的RNA二级结构预测的最大间隔训练。

J Bioinform Comput Biol. 2018 Dec;16(6):1840025. doi: 10.1142/S0219720018400255.

Convolutional neural networks for classification of alignments of non-coding RNA sequences.卷积神经网络在非编码 RNA 序列比对分类中的应用。

Bioinformatics. 2018 Jul 1;34(13):i237-i244. doi: 10.1093/bioinformatics/bty228.

LncRNAnet: long non-coding RNA identification using deep learning.LncRNAnet：使用深度学习进行长非编码 RNA 鉴定。

Bioinformatics. 2018 Nov 15;34(22):3889-3897. doi: 10.1093/bioinformatics/bty418.

Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.RFAM 13.0：转向以基因组为中心的非编码 RNA 家族资源

Nucleic Acids Res. 2018 Jan 4;46(D1):D335-D342. doi: 10.1093/nar/gkx1038.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过深度表示学习进行RNA结构比对和聚类的信息性RNA碱基嵌入

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献