Abbas Mostafa M, Mohie-Eldin Mostafa M, El-Manzalawy Yasser
KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar.
Department of Mathematics, Faculty of Science, Al-Azhar University, Cairo, Egypt.
PLoS One. 2015 Mar 24;10(3):e0119721. doi: 10.1371/journal.pone.0119721. eCollection 2015.
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
随着已测序细菌基因组数量的增加,对用于注释功能元件(如转录调控元件)的快速且可靠工具的需求变得更为迫切。启动子是关键的调控元件,它通过与多种调控蛋白(称为σ因子)结合来招募转录机器。启动子区域的识别极具挑战性,因为这些区域并不遵循特定的序列模式或基序,且难以通过实验确定。机器学习是一种用于计算识别原核生物启动子区域的有前景且经济高效的方法。然而,预测器的质量取决于多个因素,包括:i)训练数据;ii)数据表示;iii)分类算法;iv)评估程序。在这项工作中,我们创建了几种大肠杆菌启动子数据集变体,并利用它们通过实验来检验这些因素对大肠杆菌σ70启动子模型预测性能的影响。我们的结果表明,在前三个标准的某些组合下,一个预测模型在交叉验证实验中可能表现得非常好,但其在独立测试数据上的性能却可能极差。这强调了使用独立测试数据评估启动子区域预测器的重要性,它能纠正可能通过交叉验证程序估计出的过度乐观的性能。我们对测试模型的分析表明,尽管非启动子数据的获取方式如何,良好的预测模型通常表现良好。另一方面,较差的预测模型似乎对非启动子序列的选择更为敏感。有趣的是,在交叉验证和独立测试性能评估实验中,表现最佳的基于序列的分类器优于表现最佳的基于结构的分类器。最后,我们提出了一种结合两个表现最佳的基于序列和基于结构的分类器的元预测器方法,并将其性能与一些最先进的大肠杆菌σ70启动子预测方法进行比较。