Chen Xin, Su Zhengchang, Xu Ying, Jiang Tao
Department of Computer Science and Engineering, University of California at Riverside, CA 92507, USA.
Genome Inform. 2004;15(2):211-22.
We computationally predict operons in the Synechococcus sp. WH8102 genome based on three types of genomic data: intergenic distances, COG gene functions and phylogenetic profiles. In the proposed method, we first estimate a log-likelihood distribution for each type of genomic data, and then fuse these distribution information by a perceptron to discriminate pairs of genes within operons (WO pairs) from those across transcription unit borders (TUB pairs). Computational experiments demonstrated that WO pairs tend to have shorter intergenic distances, a higher probability being in the same COG functional categories and more similar phylogenetic profiles than TUB pairs, indicating their powerful capabilities for operon prediction. By testing the method on 236 known operons of Escherichia coli K12, an overall accuracy of 83.8% is obtained by joint learning from multiple types of genomic data, whereas individual information source yields accuracies of 80.4%, 74.4%, and 70.6% respectively. We have applied this new approach, in conjunction with our previous comparative genome analysis-based approach, to predict 556 (putative) operons in WH8102. All predicted data are available at (http://www.cs.ucr.edu/~xin/operons.htm) for public use.
我们基于三种类型的基因组数据,通过计算预测了聚球藻属WH8102菌株基因组中的操纵子,这三种数据分别是:基因间距离、COG基因功能和系统发育谱。在所提出的方法中,我们首先估计每种类型基因组数据的对数似然分布,然后通过感知器融合这些分布信息,以区分操纵子内的基因对(WO对)和跨越转录单元边界的基因对(TUB对)。计算实验表明,与TUB对相比,WO对往往具有更短的基因间距离、处于相同COG功能类别的更高概率以及更相似的系统发育谱,这表明它们在操纵子预测方面具有强大的能力。通过在大肠杆菌K12的236个已知操纵子上测试该方法,通过从多种类型的基因组数据进行联合学习,总体准确率达到了83.8%,而单个信息源的准确率分别为80.4%、74.4%和70.6%。我们已将这种新方法与我们之前基于比较基因组分析的方法相结合,来预测WH8102中的556个(假定的)操纵子。所有预测数据可在(http://www.cs.ucr.edu/~xin/operons.htm)获取以供公众使用。