基于自然语言技术的DNA启动子任务导向型词典挖掘与预测模型

DNA promoter task-oriented dictionary mining and prediction model based on natural language technology.

作者信息

Zeng Ruolei, Li Zihan, Li Jialu, Zhang Qingchuan

机构信息

Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA.

National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.

出版信息

Sci Rep. 2025 Jan 2;15(1):153. doi: 10.1038/s41598-024-84105-9.

DOI:10.1038/s41598-024-84105-9

PMID:39747934

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11697570/

Abstract

Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .

摘要

启动子是启动转录并调节基因表达的重要DNA序列。准确识别启动子位点对于解读基因表达模式和基因调控网络的作用至关重要。生物信息学的最新进展利用深度学习和自然语言处理（NLP）提高了启动子预测的准确性。卷积神经网络（CNN）、长短期记忆（LSTM）网络和BERT模型等技术尤其具有影响力。然而，当前方法在BERT预训练期间通常依赖于任意的DNA序列分割，这可能无法产生最佳结果。为了克服这一限制，本文介绍了一种新颖的DNA序列分割方法。该方法为DNA序列开发了一个更精细的字典，将其用于BERT预训练，并采用Inception神经网络作为基础模型。这种BERT-Inception架构跨多个粒度捕获信息。实验结果表明，该模型提高了几个下游任务的性能，并引入了深度学习可解释性，为解释和理解DNA序列信息提供了新的视角。详细的源代码可在https://github.com/katouMegumiH/Promoter_BERT获取。