Anyaso-Samuel Samuel, Sachdeva Archie, Guha Subharup, Datta Somnath
Department of Biostatistics, University of Florida, Gainesville, FL, United States.
Front Genet. 2021 Apr 20;12:642282. doi: 10.3389/fgene.2021.642282. eCollection 2021.
Microbiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbiome samples. We implemented this idea first; by utilizing standard bioinformatics procedures to pre-process the raw metagenomics samples provided by the CAMDA organizers. We trained several component classifiers and a robust ensemble classifier with data generated from taxonomy-dependent and taxonomy-free approaches. Also, we implemented class weighting and an optimal oversampling technique to overcome the class imbalance in the primary data. In each instance, we observed that the component classifiers performed differently, whereas the ensemble classifier consistently yielded optimal performance. Finally, we predicted the source cities of mystery samples provided by the organizers. Our results highlight the unreliability of restricting the classification of metagenomic samples to source origins to a single classification algorithm. By combining several component classifiers via the ensemble approach, we obtained classification results that were as good as the best-performing component classifier.
从城市环境中采集的微生物组样本有助于预测未知样本的地理位置。不同城市可能具有地理上不同的微生物特征这一观点可用于根据特定城市的微生物组样本预测地理位置。我们首先实施了这一想法;通过利用标准生物信息学程序对CAMDA组织者提供的原始宏基因组学样本进行预处理。我们使用从依赖分类法和不依赖分类法的方法生成的数据训练了几个组件分类器和一个强大的集成分类器。此外,我们实施了类别加权和最优过采样技术来克服原始数据中的类别不平衡。在每种情况下,我们都观察到组件分类器的表现各不相同,而集成分类器始终产生最优性能。最后,我们预测了组织者提供的神秘样本的来源城市。我们的结果突出了将宏基因组样本的分类限制于单一分类算法来确定来源的不可靠性。通过集成方法组合多个组件分类器,我们获得了与表现最佳的组件分类器一样好的分类结果。