Contreras-Peruyero Haydeé, Nuñez Imanol, Vazquez-Rosas-Landa Mirna, Santana-Quinteros Daniel, Pashkov Antón, Carranza-Barragán Mario E, Perez-Estrada Rafael, Guerrero-Flores Shaday, Balanzario Eugenio, Muñiz Sánchez Víctor, Nakamura Miguel, Ramírez-Ramírez L Leticia, Sélem-Mojica Nelly
Centro de Ciencias Matemáticas, Universidad Nacional Autónoma de México, Morelia, Mexico.
Centro de Investigación en Matemáticas, A.C., Guanajuato, Mexico.
Front Genet. 2024 Nov 25;15:1449461. doi: 10.3389/fgene.2024.1449461. eCollection 2024.
The Critical Assessment of Massive Data Analysis (CAMDA) addresses the complexities of harnessing Big Data in life sciences by hosting annual competitions that inspire research groups to develop innovative solutions. In 2023, the Forensic Challenge focused on identifying the city of origin for 365 metagenomic samples collected from public transportation systems and identifying associations between bacterial distribution and other covariates. For microbiome classification, we incorporated both taxonomic and functional annotations as features. To identify the most informative Operational Taxonomic Units, we selected features by fitting negative binomial models. We then implemented supervised models conducting 5-fold cross-validation (CV) with a 4:1 training-to-validation ratio. After variable selection, which reduced the dataset to fewer than 300 OTUs, the Support Vector Classifier achieved the highest F1 score (0.96). When using functional features from MIFASER, the Neural Network model outperformed other models. When considering climatic and demographic variables of the cities, Dirichlet regression over , , and bacteria abundances suggests that population increase is indeed associated with a rise in the mean of while decreasing temperature is linked to higher proportions of . This study validates microbiome classification using taxonomic features and, to a lesser extent, functional features. It shows that demographic and climatic factors influence urban microbial distribution. A Docker container and a Conda environment are available at the repository: GitHub facilitating broader adoption and validation of these methods by the scientific community.
大规模数据分析关键评估(CAMDA)通过举办年度竞赛来解决生命科学中利用大数据的复杂性问题,这些竞赛激励研究团队开发创新解决方案。2023年,法医挑战赛聚焦于确定从公共交通系统收集的365个宏基因组样本的来源城市,并确定细菌分布与其他协变量之间的关联。对于微生物组分类,我们将分类学和功能注释都纳入特征中。为了确定信息量最大的操作分类单元,我们通过拟合负二项式模型来选择特征。然后,我们实施监督模型,以4:1的训练与验证比例进行5折交叉验证(CV)。在进行变量选择后(将数据集减少到300个以下的操作分类单元),支持向量分类器获得了最高的F1分数(0.96)。当使用MIFASER的功能特征时,神经网络模型优于其他模型。在考虑城市的气候和人口统计变量时,对、和细菌丰度进行狄利克雷回归表明,人口增长确实与的平均值上升有关,而温度下降则与的比例较高有关。本研究验证了使用分类学特征以及在较小程度上使用功能特征进行微生物组分类。研究表明,人口和气候因素会影响城市微生物分布。在存储库GitHub上提供了一个Docker容器和一个Conda环境,便于科学界更广泛地采用和验证这些方法。