Suppr超能文献

使用先进的自然语言处理技术解读基因组编码:一项范围综述

Deciphering genomic codes using advanced NLP techniques: a scoping review.

作者信息

Cheng Shuyan, Wei Yishu, Zhou Yiliang, Xu Zihan, Wright Drew N, Liu Jinze, Peng Yifan

机构信息

Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065.

Samuel J. Wood Library & C.V. Starr Biomedical Information Center, Weill Cornell Medicine, New York, NY 10065.

出版信息

ArXiv. 2024 Nov 25:arXiv:2411.16084v1.

Abstract

OBJECTIVES

The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. This review aims to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.

METHODS

Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type.

RESULTS

A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility.

DISCUSSION

The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while providing a better understanding of its complex structures. It can potentially drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is needed to discuss and overcome limitations, enhancing model transparency and applicability.

摘要

目标

人类基因组测序数据的庞大和复杂性给有效分析带来了挑战。本综述旨在研究自然语言处理(NLP)技术,特别是大语言模型(LLMs)和变压器架构在解读基因组密码中的应用,重点关注词元化、变压器模型和调控注释预测。本综述旨在评估最新文献中的数据和模型可及性,以便更好地了解这些工具在处理基因组测序数据时的现有能力和限制。

方法

按照系统评价和Meta分析的首选报告项目(PRISMA)指南,我们在PubMed、Medline、Scopus、科学网、Embase和ACM数字图书馆进行了范围综述。如果研究聚焦于应用于基因组测序数据分析的NLP方法,则纳入研究,对出版日期或文章类型不做限制。

结果

总共筛选了2021年至2024年4月期间发表的26项研究进行综述。该综述强调,词元化和变压器模型增强了对基因组数据的处理和理解,并应用于预测转录因子结合位点和染色质可及性等调控注释。

讨论

将NLP和LLMs应用于基因组测序数据解读是一个有前景的领域,有助于简化大规模基因组数据的处理,同时更好地理解其复杂结构。它有可能通过为基因组分析提供更高效、可扩展的解决方案,推动个性化医疗的进步。需要进一步研究来讨论和克服局限性,提高模型的透明度和适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd8e/11623714/9a0c773e6b17/nihpp-2411.16084v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验