Rehman Mobeen Ur, Abbas Zeeshan, Ullah Farman, Hussain Irfan
Khalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
Department of Precision Medicine, Sungkyunkwan University School of Medicine, Suwon, Republic of Korea.
Comput Struct Biotechnol J. 2025 Jul 23;27:3275-3284. doi: 10.1016/j.csbj.2025.07.008. eCollection 2025.
Accurate identification of enhancer regions in DNA sequences is essential for understanding gene regulation and its role in diverse biological processes. Enhancers are regulatory elements that influence gene expression, but their detection remains challenging due to the complexity and variability of genomic sequences. In this study, we propose AttnW2V-Enhancer, a novel model that combines Word2Vec-based sequence encoding, convolutional neural networks (CNN), and attention mechanisms to address this challenge. By leveraging Word2Vec embeddings, our model captures biologically meaningful patterns and offers a more efficient and interpretable representation than traditional methods such as one-hot encoding and physicochemical descriptors. We evaluate AttnW2V-Enhancer on an independent test set, where it achieves superior performance with an accuracy of 81.75%, sensitivity of 83.50%, specificity of 80.00%, and a Matthews Correlation Coefficient (MCC) of 0.635, outperforming existing models. Additionally, we demonstrate the effectiveness of the attention mechanism in enhancing feature learning by dynamically focusing on the most relevant sequence regions. These results confirm that integrating Word2Vec encoding with CNNs and attention mechanisms provides a powerful and interpretable framework for enhancer prediction, offering valuable insights into the identification of regulatory sequences. The source code and implementation are publicly available at: https://github.com/Rehman1995/AttnW2V-Enhancer.
准确识别DNA序列中的增强子区域对于理解基因调控及其在多种生物过程中的作用至关重要。增强子是影响基因表达的调控元件,但其检测由于基因组序列的复杂性和变异性仍然具有挑战性。在本研究中,我们提出了AttnW2V-Enhancer,这是一种新型模型,它结合了基于Word2Vec的序列编码、卷积神经网络(CNN)和注意力机制来应对这一挑战。通过利用Word2Vec嵌入,我们的模型捕捉到了具有生物学意义的模式,并且比传统方法(如独热编码和物理化学描述符)提供了更高效、更具可解释性的表示。我们在一个独立测试集上评估了AttnW2V-Enhancer,它在该测试集中取得了卓越的性能,准确率为81.75%,灵敏度为83.50%,特异性为80.00%,马修斯相关系数(MCC)为0.635,优于现有模型。此外,我们通过动态聚焦于最相关的序列区域,证明了注意力机制在增强特征学习方面的有效性。这些结果证实,将Word2Vec编码与CNNs和注意力机制相结合,为增强子预测提供了一个强大且可解释的框架,为调控序列的识别提供了有价值的见解。源代码和实现可在以下网址公开获取:https://github.com/Rehman1995/AttnW2V-Enhancer。