批量归一化后再合并对于整合多个异质研究的表型预测非常有效。

Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.

机构信息

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America.

出版信息

PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct.

DOI:10.1371/journal.pcbi.1010608

PMID:37844077

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10602384/

Abstract

Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.

摘要

不同基因组研究之间的异质性会影响机器学习模型在跨研究表型预测中的性能。在纳入不同研究进行表型预测时，克服异质性是开发具有在独立数据集上可重复预测性能的机器学习算法的关键和关键步骤。我们研究了在各种不同异质性下整合同类型组学数据的不同研究的最佳方法。我们开发了一个综合工作流程，通过使用 ComBat 模拟各种不同类型的异质性，并结合批量归一化评估不同整合方法的性能。我们还通过分别在六个结直肠癌（CRC）宏基因组研究和六个结核病（TB）基因表达研究中的实际应用展示了结果。我们表明，不同基因组研究中的异质性会显著影响机器学习分类器的可重复性。当存在异质人群时，ComBat 归一化可改善机器学习分类器的预测性能，并可成功消除同一人群内的批次效应。我们还表明，随着训练和测试人群中潜在疾病模型的差异增大，机器学习分类器的预测准确性会明显降低。通过比较不同的合并和整合方法，我们发现合并和整合方法在不同情况下可能优于彼此。在实际应用中，我们观察到在 CRC 和 TB 研究中应用 ComBat 归一化与合并或整合方法均可提高预测准确性。我们说明批量归一化对于减轻不同研究人群之间的差异和批次效应都是至关重要的。我们还表明，当与批量归一化结合时，合并策略和整合方法都可以取得良好的性能。此外，我们还通过秩聚合方法探索了增强表型预测性能的潜力，并表明秩聚合方法与其他集成学习方法具有相似的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2fd6/10602384/6a4fd0ab9c85/pcbi.1010608.g001.jpg

相似文献

Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.批量归一化后再合并对于整合多个异质研究的表型预测非常有效。

PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct.

Robustifying genomic classifiers to batch effects via ensemble learning.通过集成学习使基因组分类器稳健化以应对批次效应。

Bioinformatics. 2021 Jul 12;37(11):1521-1527. doi: 10.1093/bioinformatics/btaa986.

Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs.利用机器学习实现猪生产性状的遗传位点筛选和基因组预测。

FASEB J. 2023 Jun;37(6):e22961. doi: 10.1096/fj.202300245R.

EMLI-ICC: an ensemble machine learning-based integration algorithm for metastasis prediction and risk stratification in intrahepatic cholangiocarcinoma.EMLI-ICC：一种基于集成机器学习的整合算法，用于预测肝内胆管癌的转移和风险分层。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac450.

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data.标准机器学习方法在基于转录组学数据的表型预测方面优于深度表示学习。

BMC Bioinformatics. 2020 Mar 20;21(1):119. doi: 10.1186/s12859-020-3427-8.

Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.宏基因组数据分析中预测定量表型的标准化方法评估

Front Genet. 2024 Jun 5;15:1369628. doi: 10.3389/fgene.2024.1369628. eCollection 2024.

Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.利用牛奶近红外光谱数据评估机器学习方法和变量选择方法在荷斯坦奶牛中预测难以测量性状的性能。

J Dairy Sci. 2021 Jul;104(7):8107-8121. doi: 10.3168/jds.2020-19861. Epub 2021 Apr 15.

Accurate Prediction of Coronary Heart Disease for Patients With Hypertension From Electronic Health Records With Big Data and Machine-Learning Methods: Model Development and Performance Evaluation.利用大数据和机器学习方法从电子健康记录中准确预测高血压患者的冠心病：模型开发与性能评估

JMIR Med Inform. 2020 Jul 6;8(7):e17257. doi: 10.2196/17257.

DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants.DNNGP，一种基于深度神经网络的方法，用于利用植物中的多组学数据进行基因组预测。

Mol Plant. 2023 Jan 2;16(1):279-293. doi: 10.1016/j.molp.2022.11.004. Epub 2022 Nov 10.

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data.基于机器学习的基因组预测：在合成数据和实际数据上，正则化回归、集成、基于实例和深度学习方法的性能比较。

BMC Genomics. 2024 Feb 7;25(1):152. doi: 10.1186/s12864-023-09933-x.

引用本文的文献

Uncovering key biomarkers, potential therapeutic targets and development of deep learning model in heart failure.揭示心力衰竭中的关键生物标志物、潜在治疗靶点及深度学习模型的开发。

PLoS One. 2025 Sep 3;20(9):e0330780. doi: 10.1371/journal.pone.0330780. eCollection 2025.

Development and validation of preeclampsia predictive models using key genes from bioinformatics and machine learning approaches.基于生物信息学和机器学习方法的关键基因开发和验证子痫前期预测模型。

Front Immunol. 2024 Oct 31;15:1416297. doi: 10.3389/fimmu.2024.1416297. eCollection 2024.

本文引用的文献

Cross-Platform Omics Prediction procedure: a statistical machine learning framework for wider implementation of precision medicine.跨平台组学预测程序：一种用于精准医学更广泛实施的统计机器学习框架。

NPJ Digit Med. 2022 Jul 4;5(1):85. doi: 10.1038/s41746-022-00618-5.

Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data.基于宏基因组鸟枪法测序数据，利用随机森林分类法提高结直肠癌疾病状态的预测性能。

Synth Syst Biotechnol. 2022 Jan 27;7(1):574-585. doi: 10.1016/j.synbio.2022.01.005. eCollection 2022 Mar.

Robustifying genomic classifiers to batch effects via ensemble learning.通过集成学习使基因组分类器稳健化以应对批次效应。

Bioinformatics. 2021 Jul 12;37(11):1521-1527. doi: 10.1093/bioinformatics/btaa986.

MicroPro: using metagenomic unmapped reads to provide insights into human microbiota and disease associations.MicroPro：利用宏基因组未映射reads 提供对人类微生物组和疾病关联的深入了解。

Genome Biol. 2019 Aug 6;20(1):154. doi: 10.1186/s13059-019-1773-5.

Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation.对结直肠癌数据集的宏基因组分析确定了跨队列微生物诊断特征，并与胆碱降解有关。

Nat Med. 2019 Apr;25(4):667-678. doi: 10.1038/s41591-019-0405-7. Epub 2019 Apr 1.

Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer.基于粪便宏基因组的荟萃分析揭示了与结直肠癌具有特异性的全球微生物特征。

Nat Med. 2019 Apr;25(4):679-689. doi: 10.1038/s41591-019-0406-6. Epub 2019 Apr 1.

Conducting gene set tests in meta-analyses of transcriptome expression data.在转录组表达数据的荟萃分析中进行基因集检验。

Res Synth Methods. 2019 Mar;10(1):99-112. doi: 10.1002/jrsm.1337. Epub 2019 Feb 7.

Diagnostic Potential and Interactive Dynamics of the Colorectal Cancer Virome.结直肠癌病毒组的诊断潜力和相互作用动态。

mBio. 2018 Nov 20;9(6):e02248-18. doi: 10.1128/mBio.02248-18.

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models.不同来源的异质性对基因组预测模型准确性损失的影响。

Biostatistics. 2020 Apr 1;21(2):253-268. doi: 10.1093/biostatistics/kxy044.

Existing blood transcriptional classifiers accurately discriminate active tuberculosis from latent infection in individuals from south India.现有的血液转录分类器能够准确地区分印度南部个体中的活动性肺结核和潜伏性感染。

Tuberculosis (Edinb). 2018 Mar;109:41-51. doi: 10.1016/j.tube.2018.01.002. Epub 2018 Jan 31.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

批量归一化后再合并对于整合多个异质研究的表型预测非常有效。

Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献