Díez López Celia, Montiel González Diego, Vidaki Athina, Kayser Manfred
Department of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, Netherlands.
Front Microbiol. 2022 Jul 19;13:886201. doi: 10.3389/fmicb.2022.886201. eCollection 2022.
Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.
人类微生物组研究正在从特征描述和关联研究转向医学研究、临床诊断等领域的转化应用。其中一个应用是人类特征预测,机器学习(ML)方法经常被用于此,但面临实际挑战。可用微生物组数据中的类别不平衡是主要问题之一,如果不加以考虑,会导致虚假的预测准确率,并限制分类器的泛化能力。在这里,我们通过结合数据增强技术来解决类别不平衡问题,并使用ML方法进行预测,研究了从类别不平衡的唾液微生物组数据中预测吸烟习惯的可预测性。我们收集了公开可用的唾液16S rRNA基因测序数据和吸烟习惯元数据,这些数据显示出严重的类别不平衡问题,即175名当前吸烟者与1070名非当前吸烟者。三种数据增强技术(合成少数过采样技术、自适应合成和基于树的关联数据增强)与七种ML方法一起应用:逻辑回归、k近邻、具有线性和径向核的支持向量机、决策树、随机森林和极端梯度提升。使用K折嵌套交叉验证对不同的增强数据类型和基线非增强数据进行验证,以验证预测结果。在我们的数据集中,将数据增强与ML相结合通常优于基线方法。最终的预测模型结合了基于树的关联数据增强和具有线性核的支持向量机,实现了以马修斯相关系数0.36和AUC 0.81表示的分类性能。我们的方法成功解决了微生物组数据中的类别不平衡问题,用于可靠地预测吸烟习惯。