• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估数据选择和表示对可靠的大肠杆菌σ70启动子区域预测器开发的影响。

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.

作者信息

Abbas Mostafa M, Mohie-Eldin Mostafa M, El-Manzalawy Yasser

机构信息

KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar.

Department of Mathematics, Faculty of Science, Al-Azhar University, Cairo, Egypt.

出版信息

PLoS One. 2015 Mar 24;10(3):e0119721. doi: 10.1371/journal.pone.0119721. eCollection 2015.

DOI:10.1371/journal.pone.0119721
PMID:25803493
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4372424/
Abstract

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.

摘要

随着已测序细菌基因组数量的增加,对用于注释功能元件(如转录调控元件)的快速且可靠工具的需求变得更为迫切。启动子是关键的调控元件,它通过与多种调控蛋白(称为σ因子)结合来招募转录机器。启动子区域的识别极具挑战性,因为这些区域并不遵循特定的序列模式或基序,且难以通过实验确定。机器学习是一种用于计算识别原核生物启动子区域的有前景且经济高效的方法。然而,预测器的质量取决于多个因素,包括:i)训练数据;ii)数据表示;iii)分类算法;iv)评估程序。在这项工作中,我们创建了几种大肠杆菌启动子数据集变体,并利用它们通过实验来检验这些因素对大肠杆菌σ70启动子模型预测性能的影响。我们的结果表明,在前三个标准的某些组合下,一个预测模型在交叉验证实验中可能表现得非常好,但其在独立测试数据上的性能却可能极差。这强调了使用独立测试数据评估启动子区域预测器的重要性,它能纠正可能通过交叉验证程序估计出的过度乐观的性能。我们对测试模型的分析表明,尽管非启动子数据的获取方式如何,良好的预测模型通常表现良好。另一方面,较差的预测模型似乎对非启动子序列的选择更为敏感。有趣的是,在交叉验证和独立测试性能评估实验中,表现最佳的基于序列的分类器优于表现最佳的基于结构的分类器。最后,我们提出了一种结合两个表现最佳的基于序列和基于结构的分类器的元预测器方法,并将其性能与一些最先进的大肠杆菌σ70启动子预测方法进行比较。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/edf9627afad4/pone.0119721.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/2329bced575f/pone.0119721.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/b07d001291a9/pone.0119721.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/edf9627afad4/pone.0119721.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/2329bced575f/pone.0119721.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/b07d001291a9/pone.0119721.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd0/4372424/edf9627afad4/pone.0119721.g003.jpg

相似文献

1
Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.评估数据选择和表示对可靠的大肠杆菌σ70启动子区域预测器开发的影响。
PLoS One. 2015 Mar 24;10(3):e0119721. doi: 10.1371/journal.pone.0119721. eCollection 2015.
2
Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals.大肠杆菌中的西格玛70启动子:在重叠启动子样信号密集区域的特异性转录。
J Mol Biol. 2003 Oct 17;333(2):261-78. doi: 10.1016/j.jmb.2003.07.017.
3
Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors.大肠杆菌与替代σ因子相关启动子序列核苷酸含量分析。
J Mol Recognit. 2019 May;32(5):e2770. doi: 10.1002/jmr.2770. Epub 2018 Nov 20.
4
PPred-PCKSM: A multi-layer predictor for identifying promoter and its variants using position based features.PPred-PCKSM:一种基于位置特征的使用多层面预测器来识别启动子及其变体的方法。
Comput Biol Chem. 2022 Apr;97:107623. doi: 10.1016/j.compbiolchem.2022.107623. Epub 2022 Jan 7.
5
Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes.用于预测细菌基因组中强启动子候选序列的三联体模式算法
BMC Bioinformatics. 2008 May 9;9:233. doi: 10.1186/1471-2105-9-233.
6
Isolation and characterization of mutations in region 1.2 of Escherichia coli sigma70.大肠杆菌σ70 1.2区域突变的分离与鉴定
Mol Microbiol. 2001 Oct;42(2):427-37. doi: 10.1046/j.1365-2958.2001.02642.x.
7
Interaction of Escherichia coli RNA polymerase σ70 subunit with promoter elements in the context of free σ70, RNA polymerase holoenzyme, and the β'-σ70 complex.大肠杆菌 RNA 聚合酶 σ70 亚基与启动子元件在游离 σ70、RNA 聚合酶全酶和 β'-σ70 复合物中的相互作用。
J Biol Chem. 2011 Jan 7;286(1):270-9. doi: 10.1074/jbc.M110.174102. Epub 2010 Oct 15.
8
A Computational Framework for Identifying Promoter Sequences in Nonmodel Organisms Using RNA-seq Data Sets.一种利用RNA测序数据集识别非模式生物中启动子序列的计算框架。
ACS Synth Biol. 2021 Jun 18;10(6):1394-1405. doi: 10.1021/acssynbio.1c00017. Epub 2021 May 14.
9
Substitutions in the Escherichia coli RNA polymerase sigma70 factor that affect recognition of extended -10 elements at promoters.大肠杆菌RNA聚合酶σ70因子中的替换影响启动子处扩展的-10元件的识别。
FEBS Lett. 2003 Jun 5;544(1-3):199-205. doi: 10.1016/s0014-5793(03)00500-3.
10
Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals.从全基因组推断调控元件。幽门螺杆菌σ80启动子信号家族分析。
J Mol Biol. 2000 Mar 24;297(2):335-53. doi: 10.1006/jmbi.2000.3576.

引用本文的文献

1
Label-free identification carbapenem-resistant based on surface-enhanced resonance Raman scattering.基于表面增强共振拉曼散射的无标记耐碳青霉烯类鉴定
RSC Adv. 2018 Jan 26;8(9):4761-4765. doi: 10.1039/c7ra13063e. eCollection 2018 Jan 24.
2
Benchmarking Bacterial Promoter Prediction Tools: Potentialities and Limitations.细菌启动子预测工具的基准测试:潜力与局限
mSystems. 2020 Aug 25;5(4):e00439-20. doi: 10.1128/mSystems.00439-20.

本文引用的文献

1
ARF-TSS: an alternative method for identification of transcription start site in bacteria.ARF-TSS:一种用于鉴定细菌中转录起始位点的替代方法。
Biotechniques. 2012 Apr 1;52(4):000113858. doi: 10.2144/000113858.
2
iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition.iSS-PseDNC:利用伪二核苷酸组成识别剪接位点。
Biomed Res Int. 2014;2014:623149. doi: 10.1155/2014/623149. Epub 2014 May 21.
3
A Brief Review: The Z-curve Theory and its Application in Genome Analysis.简介:Z 曲线理论及其在基因组分析中的应用。
Curr Genomics. 2014 Apr;15(2):78-94. doi: 10.2174/1389202915999140328162433.
4
An optimized potential function for the calculation of nucleic acid interaction energies I. base stacking.用于计算核酸相互作用能的优化势能函数 I. 碱基堆积。
Biopolymers. 1978 Oct;17(10):2341-60. doi: 10.1002/bip.1978.360171005.
5
iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition.iNuc-PseKNC:一种基于序列的预测器,用于预测基因组中具有伪 k-元核苷酸组成的核小体定位。
Bioinformatics. 2014 Jun 1;30(11):1522-9. doi: 10.1093/bioinformatics/btu083. Epub 2014 Feb 6.
6
kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets.kmer-SVM:一个用于在基因组数据集识别预测性调控序列特征的网络服务器。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W544-56. doi: 10.1093/nar/gkt519. Epub 2013 Jun 14.
7
iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition.iRSpot-PseDNC:基于伪二核苷酸组成识别重组热点。
Nucleic Acids Res. 2013 Apr 1;41(6):e68. doi: 10.1093/nar/gks1450. Epub 2013 Jan 8.
8
Predicting promoters by pseudo-trinucleotide compositions based on discrete wavelets transform.基于离散小波变换的伪三核苷酸组成预测启动子。
J Theor Biol. 2013 Feb 21;319:1-7. doi: 10.1016/j.jtbi.2012.11.024. Epub 2012 Dec 2.
9
RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more.RegulonDB v8.0:组学数据集、进化保守性、调控短语、交叉验证的黄金标准等。
Nucleic Acids Res. 2013 Jan;41(Database issue):D203-13. doi: 10.1093/nar/gks1201. Epub 2012 Nov 29.
10
EcoGene 3.0.生态基因 3.0。
Nucleic Acids Res. 2013 Jan;41(Database issue):D613-24. doi: 10.1093/nar/gks1235. Epub 2012 Nov 28.