Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America.
PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct.
Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.
不同基因组研究之间的异质性会影响机器学习模型在跨研究表型预测中的性能。在纳入不同研究进行表型预测时,克服异质性是开发具有在独立数据集上可重复预测性能的机器学习算法的关键和关键步骤。我们研究了在各种不同异质性下整合同类型组学数据的不同研究的最佳方法。我们开发了一个综合工作流程,通过使用 ComBat 模拟各种不同类型的异质性,并结合批量归一化评估不同整合方法的性能。我们还通过分别在六个结直肠癌(CRC)宏基因组研究和六个结核病(TB)基因表达研究中的实际应用展示了结果。我们表明,不同基因组研究中的异质性会显著影响机器学习分类器的可重复性。当存在异质人群时,ComBat 归一化可改善机器学习分类器的预测性能,并可成功消除同一人群内的批次效应。我们还表明,随着训练和测试人群中潜在疾病模型的差异增大,机器学习分类器的预测准确性会明显降低。通过比较不同的合并和整合方法,我们发现合并和整合方法在不同情况下可能优于彼此。在实际应用中,我们观察到在 CRC 和 TB 研究中应用 ComBat 归一化与合并或整合方法均可提高预测准确性。我们说明批量归一化对于减轻不同研究人群之间的差异和批次效应都是至关重要的。我们还表明,当与批量归一化结合时,合并策略和整合方法都可以取得良好的性能。此外,我们还通过秩聚合方法探索了增强表型预测性能的潜力,并表明秩聚合方法与其他集成学习方法具有相似的性能。