Suppr超能文献

基于图嵌入和集成学习的真核序列启动子预测

PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence.

机构信息

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.

School of Artificial Intelligence, Jilin University, Changchun 130012, China.

出版信息

Genes (Basel). 2023 Jul 13;14(7):1441. doi: 10.3390/genes14071441.

Abstract

Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew's correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.

摘要

启动子是转录起始位点周围的非编码 DNA 区域,负责调节基因转录过程。由于它们在基因功能和转录活性中的关键作用,准确预测启动子序列及其核心元件是生物信息学中的一个重要研究领域。目前,已经开发出基于机器学习和深度学习的启动子预测模型。然而,这些模型无法挖掘启动子序列更深层次的生物学信息,也无法考虑启动子序列之间的复杂关系。在这项工作中,我们提出了一种名为 PromGER 的新型预测模型,用于预测真核生物启动子序列。对于一个启动子序列,首先,PromGER 使用四种类型的特征编码方法来提取启动子序列中的局部信息。其次,根据启动子序列之间的潜在关系,将整个启动子序列构建为一个图。此外,应用三种不同尺度的图嵌入方法,更全面地获取图中的全局特征信息。最后,通过基于树的集成学习框架,结合序列的局部特征和全局特征,分析和预测启动子序列。与七种现有方法相比,PromGER 提高了平均特异性 13%、准确性 10%、马修相关系数 16%、精度 4%、F1 分数 6%和 AUC 9%。具体来说,本研究通过 t 分布随机邻居嵌入(t-SNE)方法和 SHAPley 可加解释(SHAP)值分析解释了 PromGER,证明了模型的可解释性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7254/10379012/4791377bd9ff/genes-14-01441-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验