利用特征四聚体基序进行RNA聚合酶II启动子预测：一种机器学习方法

Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach.

作者信息

Anwar Firoz, Baker Syed Murtuza, Jabid Taskeed, Mehedi Hasan Md, Shoyaib Mohammad, Khan Haseena, Walshe Ray

机构信息

Department of Computer Science and Engineering, East West University, Bangladesh.

出版信息

BMC Bioinformatics. 2008 Oct 4;9:414. doi: 10.1186/1471-2105-9-414.

DOI:10.1186/1471-2105-9-414

PMID:18834544

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2575220/

Abstract

BACKGROUND

Eukaryotic promoter prediction using computational analysis techniques is one of the most difficult jobs in computational genomics that is essential for constructing and understanding genetic regulatory networks. The increased availability of sequence data for various eukaryotic organisms in recent years has necessitated for better tools and techniques for the prediction and analysis of promoters in eukaryotic sequences. Many promoter prediction methods and tools have been developed to date but they have yet to provide acceptable predictive performance. One obvious criteria to improve on current methods is to devise a better system for selecting appropriate features of promoters that distinguish them from non-promoters. Secondly improved performance can be achieved by enhancing the predictive ability of the machine learning algorithms used.

RESULTS

In this paper, a novel approach is presented in which 128 4-mer motifs in conjunction with a non-linear machine-learning algorithm utilising a Support Vector Machine (SVM) are used to distinguish between promoter and non-promoter DNA sequences. By applying this approach to plant, Drosophila, human, mouse and rat sequences, the classification model has showed 7-fold cross-validation percentage accuracies of 83.81%, 94.82%, 91.25%, 90.77% and 82.35% respectively. The high sensitivity and specificity value of 0.86 and 0.90 for plant; 0.96 and 0.92 for Drosophila; 0.88 and 0.92 for human; 0.78 and 0.84 for mouse and 0.82 and 0.80 for rat demonstrate that this technique is less prone to false positive results and exhibits better performance than many other tools. Moreover, this model successfully identifies location of promoter using TATA weight matrix.

CONCLUSION

The high sensitivity and specificity indicate that 4-mer frequencies in conjunction with supervised machine-learning methods can be beneficial in the identification of RNA pol II promoters comparative to other methods. This approach can be extended to identify promoters in sequences for other eukaryotic genomes.

摘要

背景

利用计算分析技术进行真核生物启动子预测是计算基因组学中最困难的任务之一，对于构建和理解基因调控网络至关重要。近年来，各种真核生物序列数据的可用性不断增加，因此需要更好的工具和技术来预测和分析真核生物序列中的启动子。到目前为止，已经开发了许多启动子预测方法和工具，但它们尚未提供可接受的预测性能。改进当前方法的一个明显标准是设计一个更好的系统来选择启动子的适当特征，以将它们与非启动子区分开来。其次，可以通过提高所使用的机器学习算法的预测能力来实现性能提升。

结果

本文提出了一种新方法，其中结合使用128个四联体基序和利用支持向量机（SVM）的非线性机器学习算法来区分启动子和非启动子DNA序列。通过将这种方法应用于植物、果蝇、人类、小鼠和大鼠序列，分类模型分别显示出7折交叉验证准确率为83.81%、94.82%、91.25%、90.77%和82.35%。植物的高灵敏度和特异性值分别为0.86和0.90；果蝇为0.96和0.92；人类为0.88和0.92；小鼠为0.78和0.84；大鼠为0.82和0.80，这表明该技术不太容易产生假阳性结果，并且比许多其他工具表现更好。此外，该模型使用TATA权重矩阵成功识别了启动子的位置。

结论

高灵敏度和特异性表明，与其他方法相比，四联体频率结合监督机器学习方法有助于识别RNA聚合酶II启动子。这种方法可以扩展到识别其他真核生物基因组序列中的启动子。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0b7/2575220/abbad118127b/1471-2105-9-414-1.jpg

相似文献

Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach.利用特征四聚体基序进行RNA聚合酶II启动子预测：一种机器学习方法

BMC Bioinformatics. 2008 Oct 4;9:414. doi: 10.1186/1471-2105-9-414.

Human pol II promoter prediction: time series descriptors and machine learning.人类RNA聚合酶II启动子预测：时间序列描述符与机器学习

Nucleic Acids Res. 2005 Mar 1;33(4):1332-6. doi: 10.1093/nar/gki271. Print 2005.

MicroRNA transcription start site prediction with multi-objective feature selection.基于多目标特征选择的微小RNA转录起始位点预测

Stat Appl Genet Mol Biol. 2012 Jan 6;11(1):Article 6. doi: 10.2202/1544-6115.1743.

Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences.从502个不相关的启动子序列中得出的四种真核生物RNA聚合酶II启动子元件的权重矩阵描述。

J Mol Biol. 1990 Apr 20;212(4):563-78. doi: 10.1016/0022-2836(90)90223-9.

Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach.通过启动子检测工具在大型基因组序列中对启动子区域进行高度特异性定位：一种新型的上下文分析方法。

J Mol Biol. 2000 Mar 31;297(3):599-606. doi: 10.1006/jmbi.2000.3589.

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.使用卷积深度学习神经网络识别原核生物和真核生物启动子。

PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.

Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility.通过整合多样性度量、GC-偏斜和 DNA 几何弹性来鉴定植物基因组中的 TATA 和无 TATA 启动子。

Genomics. 2011 Feb;97(2):112-20. doi: 10.1016/j.ygeno.2010.11.002. Epub 2010 Nov 26.

A machine learning based method for the prediction of secretory proteins using amino acid composition, their order and similarity-search.一种基于机器学习的方法，利用氨基酸组成、顺序和相似性搜索来预测分泌蛋白。

In Silico Biol. 2008;8(2):129-40.

Computational analysis of plant RNA Pol-II promoters.植物RNA聚合酶II启动子的计算分析

Biosystems. 2006 Jan;83(1):38-50. doi: 10.1016/j.biosystems.2005.09.001. Epub 2005 Oct 19.

Predicting Pol II promoter sequences using transcription factor binding sites.利用转录因子结合位点预测RNA聚合酶II启动子序列

J Mol Biol. 1995 Jun 23;249(5):923-32. doi: 10.1006/jmbi.1995.0349.

引用本文的文献

Biological and Molecular Components for Genetically Engineering Biosensors in Plants.用于植物基因工程生物传感器的生物和分子组件。

Biodes Res. 2022 Nov 9;2022:9863496. doi: 10.34133/2022/9863496. eCollection 2022.

Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.原核生物和真核生物启动子预测的计算工具的批判性评估。

Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab551.

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species.跨物种启动子预测中机器学习与深度学习技术的比较

PeerJ Comput Sci. 2021 Feb 9;7:e365. doi: 10.7717/peerj-cs.365. eCollection 2021.

Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure.深度学习表明，基因表达是由共同进化的相互作用基因调控结构的所有部分编码的。

Nat Commun. 2020 Dec 1;11(1):6141. doi: 10.1038/s41467-020-19921-4.

Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Based on Genotyping by Sequencing Data Using Deep Learning.基于深度学习的测序数据基因分型鉴定与野麻蚕野蚕丝素和杂蛋白含量相关的调控 SNP。

Genes (Basel). 2020 Jun 5;11(6):614. doi: 10.3390/genes11060614.

Viral taxonomy derived from evolutionary genome relationships.病毒分类学源自进化基因组关系。

PLoS One. 2019 Aug 14;14(8):e0220440. doi: 10.1371/journal.pone.0220440. eCollection 2019.

iProEP: A Computational Predictor for Predicting Promoter.iProEP：一种用于预测启动子的计算预测工具。

Mol Ther Nucleic Acids. 2019 Sep 6;17:337-346. doi: 10.1016/j.omtn.2019.05.028. Epub 2019 Jun 13.

Nucleotide patterns aiding in prediction of eukaryotic promoters.有助于预测真核生物启动子的核苷酸模式。

PLoS One. 2017 Nov 15;12(11):e0187243. doi: 10.1371/journal.pone.0187243. eCollection 2017.

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.使用卷积深度学习神经网络识别原核生物和真核生物启动子。

PLoS One. 2017 Feb 3;12(2):e0171410. doi: 10.1371/journal.pone.0171410. eCollection 2017.

NPEST: a nonparametric method and a database for transcription start site prediction.NPEST：一种用于转录起始位点预测的非参数方法及数据库。

Quant Biol. 2013 Dec;1(4):261-271. doi: 10.1007/s40484-013-0022-2.

本文引用的文献

Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction.果蝇核心启动子模块的鉴定及其在准确转录起始位点预测中的应用。

Nucleic Acids Res. 2006;34(20):5943-50. doi: 10.1093/nar/gkl608. Epub 2006 Oct 26.

Genome-wide analysis of core promoter elements from conserved human and mouse orthologous pairs.对来自保守的人类和小鼠直系同源基因对的核心启动子元件进行全基因组分析。

BMC Bioinformatics. 2006 Mar 7;7:114. doi: 10.1186/1471-2105-7-114.

EPD in its twentieth year: towards complete promoter coverage of selected model organisms.EPD二十年：迈向选定模式生物启动子的完全覆盖

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D82-5. doi: 10.1093/nar/gkj146.

Using hexamers to predict cis-regulatory motifs in Drosophila.利用六聚体预测果蝇中的顺式调控基序。

BMC Bioinformatics. 2005 Oct 27;6:262. doi: 10.1186/1471-2105-6-262.

Computational analysis of plant RNA Pol-II promoters.植物RNA聚合酶II启动子的计算分析

Biosystems. 2006 Jan;83(1):38-50. doi: 10.1016/j.biosystems.2005.09.001. Epub 2005 Oct 19.

Human pol II promoter prediction: time series descriptors and machine learning.人类RNA聚合酶II启动子预测：时间序列描述符与机器学习

Nucleic Acids Res. 2005 Mar 1;33(4):1332-6. doi: 10.1093/nar/gki271. Print 2005.

Plant promoter prediction with confidence estimation.具有置信度估计的植物启动子预测

Nucleic Acids Res. 2005 Feb 18;33(3):1069-76. doi: 10.1093/nar/gki247. Print 2005.

Synergy of human Pol II core promoter elements revealed by statistical sequence analysis.通过统计序列分析揭示的人类RNA聚合酶II核心启动子元件的协同作用

Bioinformatics. 2005 Apr 15;21(8):1295-300. doi: 10.1093/bioinformatics/bti172. Epub 2004 Nov 30.

Predicting polymerase II core promoters by cooperating transcription factor binding sites in eukaryotic genes.通过真核基因中协同转录因子结合位点预测聚合酶II核心启动子

Acta Biochim Biophys Sin (Shanghai). 2004 Apr;36(4):250-8. doi: 10.1093/abbs/36.4.250.

The MTE, a new core promoter element for transcription by RNA polymerase II.MTE是一种由RNA聚合酶II转录的新型核心启动子元件。

Genes Dev. 2004 Jul 1;18(13):1606-17. doi: 10.1101/gad.1193404.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用特征四聚体基序进行RNA聚合酶II启动子预测：一种机器学习方法

Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献