Department of Computer Science, East Stroudsburg University, East Stroudsburg, PA 18301, USA.
BMC Genomics. 2010 Nov 2;11 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2164-11-S2-S1.
Genomic islands (GIs) are clusters of alien genes in some bacterial genomes, but not be seen in the genomes of other strains within the same genus. The detection of GIs is extremely important to the medical and environmental communities. Despite the discovery of the GI associated features, accurate detection of GIs is still far from satisfactory.
In this paper, we combined multiple GI-associated features, and applied and compared various machine learning approaches to evaluate the classification accuracy of GIs datasets on three genera: Salmonella, Staphylococcus, Streptococcus, and their mixed dataset of all three genera. The experimental results have shown that, in general, the decision tree approach outperformed better than other machine learning methods according to five performance evaluation metrics. Using J48 decision trees as base classifiers, we further applied four ensemble algorithms, including adaBoost, bagging, multiboost and random forest, on the same datasets. We found that, overall, these ensemble classifiers could improve classification accuracy.
We conclude that decision trees based ensemble algorithms could accurately classify GIs and non-GIs, and recommend the use of these methods for the future GI data analysis. The software package for detecting GIs can be accessed at http://www.esu.edu/cpsc/che_lab/software/GIDetector/.
基因组岛(GI)是某些细菌基因组中外来基因的聚类,但在同一属内的其他菌株的基因组中看不到。GI 的检测对医疗和环境界极为重要。尽管发现了与 GI 相关的特征,但 GI 的准确检测仍远未令人满意。
在本文中,我们结合了多个与 GI 相关的特征,并应用和比较了各种机器学习方法来评估 GI 数据集在三个属(沙门氏菌、葡萄球菌、链球菌)及其三个属的混合数据集上的分类准确性。实验结果表明,总体而言,决策树方法在五种性能评估指标中优于其他机器学习方法。我们使用 J48 决策树作为基础分类器,进一步将四个集成算法(包括 adaBoost、bagging、multiboost 和随机森林)应用于相同的数据集。我们发现,总体而言,这些集成分类器可以提高分类准确性。
我们得出结论,基于决策树的集成算法可以准确地对 GI 和非 GI 进行分类,并建议在未来的 GI 数据分析中使用这些方法。用于检测 GI 的软件包可在 http://www.esu.edu/cpsc/che_lab/software/GIDetector/ 上获得。