Yuan Bo, Wang Shulei
Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
Nat Commun. 2025 Sep 1;16(1):8147. doi: 10.1038/s41467-025-63425-y.
Data integration is a powerful tool for facilitating a comprehensive and generalizable understanding of microbial communities and their association with outcomes of interest. However, integrating data sets from different studies remains a challenging problem because of severe batch effects, unobserved confounding variables, and high heterogeneity across data sets. We propose a new data integration method called MetaDICT, which initially estimates the batch effects by weighting methods in causal inference literature and then refines the estimation via novel shared dictionary learning. Compared with existing methods, MetaDICT can better avoid the overcorrection of batch effects and preserve biological variation when there exist unobserved confounding variables, data sets are highly heterogeneous across studies, or the batch is completely confounded with some covariates. Furthermore, MetaDICT can generate comparable embedding at both taxa and sample levels that can be used to unravel the hidden structure of the integrated data and improve the integrative analysis. Applications to synthetic and real microbiome data sets demonstrate the robustness and effectiveness of MetaDICT in integrative analysis. Using MetaDICT, we characterize microbial interaction, identify generalizable microbial signatures, and enhance the accuracy of outcome prediction in two real integrative studies, including an integrative analysis of colorectal cancer metagenomics studies and a meta-analysis of immunotherapy microbiome studies.
数据整合是一种强大的工具,有助于全面且可推广地理解微生物群落及其与感兴趣的结果之间的关联。然而,由于严重的批次效应、未观察到的混杂变量以及数据集之间的高度异质性,整合来自不同研究的数据集仍然是一个具有挑战性的问题。我们提出了一种名为MetaDICT的新数据整合方法,该方法首先通过因果推断文献中的加权方法估计批次效应,然后通过新颖的共享字典学习来完善估计。与现有方法相比,当存在未观察到的混杂变量、数据集在不同研究中高度异质或批次与某些协变量完全混淆时,MetaDICT可以更好地避免批次效应的过度校正并保留生物变异。此外,MetaDICT可以在分类群和样本水平上生成可比的嵌入,可用于揭示整合数据的隐藏结构并改进整合分析。在合成和真实微生物组数据集上的应用证明了MetaDICT在整合分析中的稳健性和有效性。使用MetaDICT,我们在两项实际整合研究中表征了微生物相互作用、识别了可推广的微生物特征并提高了结果预测的准确性,其中包括对结直肠癌宏基因组学研究的整合分析和免疫治疗微生物组研究的荟萃分析。