Langille Morgan G I, Hsiao William W L, Brinkman Fiona S L
Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.
BMC Bioinformatics. 2008 Aug 5;9:329. doi: 10.1186/1471-2105-9-329.
Genomic islands (GIs) are clusters of genes in prokaryotic genomes of probable horizontal origin. GIs are disproportionately associated with microbial adaptations of medical or environmental interest. Recently, multiple programs for automated detection of GIs have been developed that utilize sequence composition characteristics, such as G+C ratio and dinucleotide bias. To robustly evaluate the accuracy of such methods, we propose that a dataset of GIs be constructed using criteria that are independent of sequence composition-based analysis approaches.
We developed a comparative genomics approach (IslandPick) that identifies both very probable islands and non-island regions. The approach involves 1) flexible, automated selection of comparative genomes for each query genome, using a distance function that picks appropriate genomes for identification of GIs, 2) identification of regions unique to the query genome, compared with the chosen genomes (positive dataset) and 3) identification of regions conserved across all genomes (negative dataset). Using our constructed datasets, we investigated the accuracy of several sequence composition-based GI prediction tools.
Our results indicate that AlienHunter has the highest recall, but the lowest measured precision, while SIGI-HMM is the most precise method. SIGI-HMM and IslandPath/DIMOB have comparable overall highest accuracy. Our comparative genomics approach, IslandPick, was the most accurate, compared with a curated list of GIs, indicating that we have constructed suitable datasets. This represents the first evaluation, using diverse and, independent datasets that were not artificially constructed, of the accuracy of several sequence composition-based GI predictors. The caveats associated with this analysis and proposals for optimal island prediction are discussed.
基因组岛(GIs)是原核生物基因组中可能源于水平转移的基因簇。基因组岛与具有医学或环境意义的微生物适应性异常相关。最近,已经开发了多个利用序列组成特征(如G+C比例和二核苷酸偏差)自动检测基因组岛的程序。为了可靠地评估这些方法的准确性,我们建议使用独立于基于序列组成的分析方法的标准构建一个基因组岛数据集。
我们开发了一种比较基因组学方法(IslandPick),该方法可识别极有可能的岛和非岛区域。该方法包括:1)使用距离函数为每个查询基因组灵活、自动地选择比较基因组,该距离函数选择合适的基因组以识别基因组岛;2)与所选基因组相比,识别查询基因组特有的区域(阳性数据集);3)识别所有基因组中保守的区域(阴性数据集)。使用我们构建的数据集,我们研究了几种基于序列组成的基因组岛预测工具的准确性。
我们的结果表明,AlienHunter召回率最高,但实测精度最低,而SIGI-HMM是最精确的方法。SIGI-HMM和IslandPath/DIMOB的总体最高准确率相当。与精心策划的基因组岛列表相比,我们的比较基因组学方法IslandPick最准确,这表明我们构建了合适的数据集。这是首次使用多样且非人工构建的独立数据集对几种基于序列组成的基因组岛预测器的准确性进行评估。讨论了与此分析相关的注意事项以及最佳岛预测的建议。