School of Mathematics and Statistics, Hainan University, Haikou, 570228, China.
School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
BMC Biol. 2024 May 30;22(1):126. doi: 10.1186/s12915-024-01923-z.
A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches.
In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability.
msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
启动子是 DNA 中具有转录调控功能的特定序列,在启动基因表达中发挥作用。鉴定启动子及其强度可以提供与人类疾病相关的有价值的信息。近年来,计算方法作为鉴定启动子的有效手段得到了重视,为劳动密集型的生物学方法提供了更有效的替代方案。
本研究提出了一种称为“msBERT-Promoter”的两阶段集成预测器,用于识别启动子并预测其强度。该模型通过标记化策略整合多尺度序列信息,并对 DNABERT 模型进行微调。然后使用软投票融合多尺度信息,有效解决了传统模型中提取 DNA 序列信息不足的问题。据我们所知,这是首次将集成方法应用于 DNABERT 模型进行启动子识别和强度预测。我们的模型在启动子识别方面的准确率达到 96.2%,在启动子强度预测方面的准确率达到 79.8%,显著优于现有方法。此外,通过注意力机制分析,我们证明了我们的模型可以有效地结合局部和全局序列信息,提高其可解释性。
msBERT-Promoter 提供了一种有效的工具,可以成功地捕捉 DNA 启动子的序列相关属性,并能准确地识别启动子和预测其强度。这项工作为人工智能在传统生物学中的应用开辟了新的途径。