Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, USA.
Bioinformatics. 2010 Aug 15;26(16):2051-2. doi: 10.1093/bioinformatics/btq299. Epub 2010 Jun 10.
Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves approximately 92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy.
The program has been implemented as a standalone prediction program, cBar, which is available at http://csbl.bmb.uga.edu/~ffzhou/cBar.
由于全球范围内对微生物群落进行整体研究的努力迅速增加,产生了大量的宏基因组序列数据。如果不是所有的话,那么大多数测序的宏基因组都是来自多个生物体(可能来自不同的生物界)的染色体和质粒序列片段的复杂混合物。用于预测基因等基因组元件的计算方法对于染色体和质粒有很大的不同,因此需要将宏基因组中的染色体与质粒序列分离。我们提出了一种基于五聚体频率区分的宏基因组分类程序,用于将宏基因组集分类为染色体和质粒序列。在由所有已测序的原核染色体和质粒组成的大型训练集上,该程序的分类准确性约为 92%。在包含长度为 300bp 至 100kbp 的序列的大型模拟宏基因组集上,该程序的分类准确性为 64.45%至 88.75%。在大型独立测试集上,该程序的分类准确性达到 88.29%。
该程序已实现为一个独立的预测程序 cBar,可在 http://csbl.bmb.uga.edu/~ffzhou/cBar 上获得。