MulTFBS：一种具有多通道的时空网络，用于预测转录因子结合位点。

MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites.

机构信息

The School of Science, Dalian Maritime University, Dalian 116026, China.

出版信息

J Chem Inf Model. 2024 May 27;64(10):4322-4333. doi: 10.1021/acs.jcim.3c02088. Epub 2024 May 11.

DOI:10.1021/acs.jcim.3c02088

Abstract

Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.

摘要

揭示影响转录因子结合特异性的机制是理解基因调控的关键。在以前的研究中，已经成功地使用 DNA 双螺旋结构和独热嵌入来设计用于预测转录因子结合位点 (TFBS) 的计算方法。然而，DNA 序列作为一种生物语言，在 TFBS 预测模型中尚未得到适当考虑自然语言处理中的词嵌入表示方法。在我们的工作中，我们整合了 DNA 序列的不同类型特征来设计一个多通道深度学习框架，即 MulTFBS，其中独立的独热编码、可以合并上下文信息并提取序列全局特征的词嵌入编码以及双螺旋三维结构特征已在不同通道中进行了训练。为了有效地提取序列的高级信息，在我们的深度学习框架中，我们通过结合卷积神经网络和具有注意力机制的双向长短期记忆网络来选择时空网络。在 66 个不同转录因子的通用蛋白质结合微阵列数据集上，与六种最先进的方法进行比较，MulTFBS 在所有回归任务数据集中的表现都优于其他方法，平均为 0.698，平均 PCC 为 0.833，分别比次优方法 CRPTS 高 5.4%和 3.2%。此外，我们评估了 MulTFBS 在区分 TF ChIP-seq 数据上的结合或未结合区域的分类性能。结果表明，我们的框架在 TFBS 分类任务中也表现良好。