Suppr超能文献

miProBERT:基于预训练模型 BERT 的 microRNA 启动子识别。

miProBERT: identification of microRNA promoters based on the pre-trained model BERT.

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

出版信息

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad093.

Abstract

Accurate prediction of promoter regions driving miRNA gene expression has become a major challenge due to the lack of annotation information for pri-miRNA transcripts. This defect hinders our understanding of miRNA-mediated regulatory networks. Some algorithms have been designed during the past decade to detect miRNA promoters. However, these methods rely on biosignal data such as CpG islands and still need to be improved. Here, we propose miProBERT, a BERT-based model for predicting promoters directly from gene sequences without using any structural or biological signals. According to our information, it is the first time a BERT-based model has been employed to identify miRNA promoters. We use the pre-trained model DNABERT, fine-tune the pre-trained model on the gene promoter dataset so that the model includes information about the richer biological properties of promoter sequences in its representation, and then systematically scan the upstream regions of each intergenic miRNA using the fine-tuned model. About, 665 miRNA promoters are found. The innovative use of a random substitution strategy to construct a negative dataset improves the discriminative ability of the model and further reduces the false positive rate (FPR) to as low as 0.0421. On independent datasets, miProBERT outperformed other gene promoter prediction methods. With comparison on 33 experimentally validated miRNA promoter datasets, miProBERT significantly outperformed previously developed miRNA promoter prediction programs with 78.13% precision and 75.76% recall. We further verify the predicted promoter regions by analyzing conservation, CpG content and histone marks. The effectiveness and robustness of miProBERT are highlighted.

摘要

准确预测驱动 miRNA 基因表达的启动子区域已成为一个主要挑战,因为缺少 pri-miRNA 转录本的注释信息。这一缺陷阻碍了我们对 miRNA 介导的调控网络的理解。过去十年中设计了一些算法来检测 miRNA 启动子。然而,这些方法依赖于生物信号数据,如 CpG 岛,仍需要改进。在这里,我们提出了 miProBERT,这是一种基于 BERT 的模型,可以直接从基因序列预测启动子,而无需使用任何结构或生物信号。据我们所知,这是首次使用基于 BERT 的模型来识别 miRNA 启动子。我们使用预先训练的模型 DNABERT,在基因启动子数据集上微调预训练模型,使模型在其表示中包含关于启动子序列更丰富的生物学特性的信息,然后使用微调后的模型系统地扫描每个基因间 miRNA 的上游区域。大约有 665 个 miRNA 启动子被发现。创新地使用随机替换策略构建负数据集提高了模型的判别能力,并将模型的假阳性率(FPR)进一步降低到 0.0421。在独立数据集上,miProBERT 优于其他基因启动子预测方法。在 33 个经过实验验证的 miRNA 启动子数据集上进行比较,miProBERT 显著优于以前开发的 miRNA 启动子预测程序,具有 78.13%的精度和 75.76%的召回率。我们进一步通过分析保守性、CpG 含量和组蛋白标记来验证预测的启动子区域。miProBERT 的有效性和鲁棒性得到了突出。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验