Centro de Ciencias Aplicadas y Desarrollo Tecnológico, Universidad Nacional Autónoma de México, México, D.F., México.
Nucleic Acids Res. 2010 Jul;38(12):e130. doi: 10.1093/nar/gkq254. Epub 2010 Apr 12.
We present a simple and highly accurate computational method for operon prediction, based on intergenic distances and functional relationships between the protein products of contiguous genes, as defined by STRING database (Jensen,L.J., Kuhn,M., Stark,M., Chaffron,S., Creevey,C., Muller,J., Doerks,T., Julien,P., Roth,A., Simonovic,M. et al. (2009) STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res., 37, D412-D416). These two parameters were used to train a neural network on a subset of experimentally characterized Escherichia coli and Bacillus subtilis operons. Our predictive model was successfully tested on the set of experimentally defined operons in E. coli and B. subtilis, with accuracies of 94.6 and 93.3%, respectively. As far as we know, these are the highest accuracies ever obtained for predicting bacterial operons. Furthermore, in order to evaluate the predictable accuracy of our model when using an organism's data set for the training procedure, and a different organism's data set for testing, we repeated the E. coli operon prediction analysis using a neural network trained with B. subtilis data, and a B. subtilis analysis using a neural network trained with E. coli data. Even for these cases, the accuracies reached with our method were outstandingly high, 91.5 and 93%, respectively. These results show the potential use of our method for accurately predicting the operons of any other organism. Our operon predictions for fully-sequenced genomes are available at http://operons.ibt.unam.mx/OperonPredictor/.
我们提出了一种简单而高度准确的操纵子预测计算方法,该方法基于基因间距离和连续基因的蛋白质产物之间的功能关系,这些关系由 STRING 数据库定义(Jensen,L.J.,Kuhn,M.,Stark,M.,Chaffron,S.,Creevey,C.,Muller,J.,Doerks,T.,Julien,P.,Roth,A.,Simonovic,M.等人(2009 年)STRING 8-630 种生物体中蛋白质及其功能相互作用的全局视图。核酸研究,37,D412-D416)。这两个参数用于在一组实验表征的大肠杆菌和枯草芽孢杆菌操纵子的子集上训练神经网络。我们的预测模型在大肠杆菌和枯草芽孢杆菌的实验定义操纵子集上进行了成功测试,准确性分别为 94.6%和 93.3%。据我们所知,这是迄今为止获得的预测细菌操纵子的最高准确性。此外,为了评估在使用生物体数据集进行训练过程,而使用不同生物体数据集进行测试时我们的模型的可预测准确性,我们使用枯草芽孢杆菌数据训练的神经网络重复了大肠杆菌操纵子预测分析,以及使用大肠杆菌数据训练的神经网络重复了枯草芽孢杆菌分析。即使在这些情况下,我们的方法达到的准确性也非常高,分别为 91.5%和 93%。这些结果表明,我们的方法具有准确预测任何其他生物体操纵子的潜力。我们的全序列基因组操纵子预测可在 http://operons.ibt.unam.mx/OperonPredictor/ 上获得。