Suppr超能文献

RNA序列分析全景:任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述

RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.

作者信息

Asim Muhammad Nabeel, Ibrahim Muhammad Ali, Asif Tayyaba, Dengel Andreas

机构信息

German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany.

Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany.

出版信息

Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.

Abstract

Deciphering information of RNA sequences reveals their diverse roles in living organisms, including gene regulation and protein synthesis. Aberrations in RNA sequence such as dysregulation and mutations can drive a diverse spectrum of diseases including cancers, genetic disorders, and neurodegenerative conditions. Furthermore, researchers are harnessing RNA's therapeutic potential for transforming traditional treatment paradigms into personalized therapies through the development of RNA-based drugs and gene therapies. To gain insights of biological functions and to detect diseases at early stages and develop potent therapeutics, researchers are performing diverse types RNA sequence analysis tasks. RNA sequence analysis through conventional wet-lab methods is expensive, time-consuming and error prone. To enable large-scale RNA sequence analysis, empowerment of wet-lab experimental methods with Artificial Intelligence (AI) applications necessitates scientists to have a comprehensive knowledge of both DNA and AI fields. While molecular biologists encounter challenges in understanding AI methods, computer scientists often lack basic foundations of RNA sequence analysis tasks. Considering the absence of a comprehensive literature that bridges this research gap and promotes the development of AI-driven RNA sequence analysis applications, the contributions of this manuscript are manifold: It equips AI researchers with biological foundations of 47 distinct RNA sequence analysis tasks. It sets a stage for development of benchmark datasets related to 47 distinct RNA sequence analysis tasks by facilitating cruxes of 64 different biological databases. It presents word embeddings and language models applications across 47 distinct RNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 58 word embeddings and 70 language models based predictive pipelines performance values as well as top performing traditional sequence encoding based predictors and their performances across 47 RNA sequence analysis tasks.

摘要

解读RNA序列信息揭示了它们在生物体中的多种作用,包括基因调控和蛋白质合成。RNA序列的异常,如失调和突变,可引发包括癌症、遗传疾病和神经退行性疾病在内的多种疾病。此外,研究人员正在利用RNA的治疗潜力,通过开发基于RNA的药物和基因疗法,将传统治疗模式转变为个性化疗法。为了深入了解生物学功能、早期检测疾病并开发有效的治疗方法,研究人员正在执行各种类型的RNA序列分析任务。通过传统的湿实验室方法进行RNA序列分析既昂贵又耗时,而且容易出错。为了实现大规模RNA序列分析,用人工智能(AI)应用增强湿实验室实验方法,要求科学家对DNA和AI领域都有全面的了解。虽然分子生物学家在理解AI方法时遇到挑战,但计算机科学家往往缺乏RNA序列分析任务的基础知识。考虑到缺乏一篇全面的文献来填补这一研究空白并促进人工智能驱动的RNA序列分析应用的发展,本文稿的贡献是多方面的:它为人工智能研究人员提供了47种不同RNA序列分析任务的生物学基础。它通过促进64个不同生物数据库的关键内容,为与47种不同RNA序列分析任务相关的基准数据集的开发奠定了基础。它展示了跨47种不同RNA序列分析任务的词嵌入和语言模型应用。它通过全面调查58种词嵌入和70种基于语言模型的预测管道性能值,以及表现最佳的基于传统序列编码的预测器及其在47种RNA序列分析任务中的性能,简化了新预测器的开发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c8/11783440/33e63ecbb2d6/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验