Suppr超能文献

基于自然语言技术的DNA启动子任务导向型词典挖掘与预测模型

DNA promoter task-oriented dictionary mining and prediction model based on natural language technology.

作者信息

Zeng Ruolei, Li Zihan, Li Jialu, Zhang Qingchuan

机构信息

Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA.

National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.

出版信息

Sci Rep. 2025 Jan 2;15(1):153. doi: 10.1038/s41598-024-84105-9.

Abstract

Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .

摘要

启动子是启动转录并调节基因表达的重要DNA序列。准确识别启动子位点对于解读基因表达模式和基因调控网络的作用至关重要。生物信息学的最新进展利用深度学习和自然语言处理(NLP)提高了启动子预测的准确性。卷积神经网络(CNN)、长短期记忆(LSTM)网络和BERT模型等技术尤其具有影响力。然而,当前方法在BERT预训练期间通常依赖于任意的DNA序列分割,这可能无法产生最佳结果。为了克服这一限制,本文介绍了一种新颖的DNA序列分割方法。该方法为DNA序列开发了一个更精细的字典,将其用于BERT预训练,并采用Inception神经网络作为基础模型。这种BERT-Inception架构跨多个粒度捕获信息。实验结果表明,该模型提高了几个下游任务的性能,并引入了深度学习可解释性,为解释和理解DNA序列信息提供了新的视角。详细的源代码可在https://github.com/katouMegumiH/Promoter_BERT获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a49c/11697570/f57a69ce6bad/41598_2024_84105_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验