检测和纠正多站点神经影像学数据集的偏差。

Detect and correct bias in multi-site neuroimaging datasets.

机构信息

Lab for Artificial Intelligence in Medical Imaging (AI-Med), Department of Child and Adolescent Psychiatry, University Hospital, LMU München, Germany.

Umeå Center for Functional Brain Imaging, Department of Radiation Sciences, Umeå University.

出版信息

Med Image Anal. 2021 Jan;67:101879. doi: 10.1016/j.media.2020.101879. Epub 2020 Oct 21.

DOI:10.1016/j.media.2020.101879

PMID:33152602

Abstract

The desire to train complex machine learning algorithms and to increase the statistical power in association studies drives neuroimaging research to use ever-larger datasets. The most obvious way to increase sample size is by pooling scans from independent studies. However, simple pooling is often ill-advised as selection, measurement, and confounding biases may creep in and yield spurious correlations. In this work, we combine 35,320 magnetic resonance images of the brain from 17 studies to examine bias in neuroimaging. In the first experiment, Name That Dataset, we provide empirical evidence for the presence of bias by showing that scans can be correctly assigned to their respective dataset with 71.5% accuracy. Given such evidence, we take a closer look at confounding bias, which is often viewed as the main shortcoming in observational studies. In practice, we neither know all potential confounders nor do we have data on them. Hence, we model confounders as unknown, latent variables. Kolmogorov complexity is then used to decide whether the confounded or the causal model provides the simplest factorization of the graphical model. Finally, we present methods for dataset harmonization and study their ability to remove bias in imaging features. In particular, we propose an extension of the recently introduced ComBat algorithm to control for global variation across image features, inspired by adjusting for unknown population stratification in genetics. Our results demonstrate that harmonization can reduce dataset-specific information in image features. Further, confounding bias can be reduced and even turned into a causal relationship. However, harmonization also requires caution as it can easily remove relevant subject-specific information. Code is available at https://github.com/ai-med/Dataset-Bias.

摘要

人们希望训练复杂的机器学习算法并提高关联研究中的统计能力，这推动神经影像学研究使用越来越大的数据集。增加样本量最明显的方法是汇集来自独立研究的扫描。然而，简单的汇集通常是不明智的，因为选择、测量和混杂偏差可能会潜入并产生虚假的相关性。在这项工作中，我们结合了来自 17 项研究的 35320 张大脑磁共振图像，以检查神经影像学中的偏差。在第一个实验“命名数据集”中，我们通过显示扫描可以以 71.5%的准确率正确分配给其各自的数据集，提供了存在偏差的经验证据。有了这样的证据，我们仔细研究了混杂偏差，这通常被认为是观察性研究的主要缺点。在实践中，我们既不知道所有潜在的混杂因素，也没有关于它们的数据。因此，我们将混杂因素建模为未知的潜在变量。然后使用柯尔莫哥洛夫复杂性来确定混杂模型还是因果模型为图形模型提供了最简单的因子分解。最后，我们提出了数据集协调的方法，并研究了它们去除成像特征偏差的能力。特别是，我们提出了一种最近引入的 ComBat 算法的扩展，以控制图像特征中的全局变化，这受到遗传学中调整未知群体分层的启发。我们的结果表明，协调可以减少图像特征中特定于数据集的信息。此外，可以减少混杂偏差，甚至可以将其转变为因果关系。然而，协调也需要谨慎，因为它很容易去除相关的个体特定信息。代码可在 https://github.com/ai-med/Dataset-Bias 获得。

相似文献

Detect and correct bias in multi-site neuroimaging datasets.检测和纠正多站点神经影像学数据集的偏差。

Med Image Anal. 2021 Jan;67:101879. doi: 10.1016/j.media.2020.101879. Epub 2020 Oct 21.

ComBat Harmonization: Empirical Bayes versus fully Bayes approaches.ComBat 调和：经验贝叶斯与完全贝叶斯方法比较。

Neuroimage Clin. 2023;39:103472. doi: 10.1016/j.nicl.2023.103472. Epub 2023 Jul 13.

Comparison of traveling-subject and ComBat harmonization methods for assessing structural brain characteristics.比较旅行对象和 ComBat 协调方法，以评估结构脑特征。

Hum Brain Mapp. 2021 Nov;42(16):5278-5287. doi: 10.1002/hbm.25615. Epub 2021 Aug 17.

Removing the effects of the site in brain imaging machine-learning - Measurement and extendable benchmark.消除脑成像机器学习中部位的影响——测量与可扩展基准。

Neuroimage. 2023 Jan;265:119800. doi: 10.1016/j.neuroimage.2022.119800. Epub 2022 Dec 5.

Effect of data harmonization of multicentric dataset in ASD/TD classification.多中心数据集数据整合在自闭症谱系障碍/典型发育分类中的作用。

Brain Inform. 2023 Nov 25;10(1):32. doi: 10.1186/s40708-023-00210-x.

Mitigating site effects in covariance for machine learning in neuroimaging data.减轻神经影像学数据中机器学习协方差中的站点效应。

Hum Brain Mapp. 2022 Mar;43(4):1179-1195. doi: 10.1002/hbm.25688. Epub 2021 Dec 14.

Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets.基于机器学习的 MRI 数据调和功效：36 个数据集的多中心研究。

Sci Data. 2024 Jan 23;11(1):115. doi: 10.1038/s41597-023-02421-7.

DeepResBat: Deep residual batch harmonization accounting for covariate distribution differences.DeepResBat：考虑协变量分布差异的深度残差批量协调。

Med Image Anal. 2025 Jan;99:103354. doi: 10.1016/j.media.2024.103354. Epub 2024 Sep 21.

Harmonization of resting-state functional MRI data across multiple imaging sites via the separation of site differences into sampling bias and measurement bias.通过将站点差异分离为抽样偏差和测量偏差，实现多个成像站点的静息态功能磁共振成像数据的协调。

PLoS Biol. 2019 Apr 18;17(4):e3000042. doi: 10.1371/journal.pbio.3000042. eCollection 2019 Apr.

DeepResBat: deep residual batch harmonization accounting for covariate distribution differences.深度残差批次协调法：考虑协变量分布差异的深度残差批次协调

bioRxiv. 2024 Aug 6:2024.01.18.574145. doi: 10.1101/2024.01.18.574145.

引用本文的文献

When no answer is better than a wrong answer: A causal perspective on batch effects.当无答案优于错误答案时：批次效应的因果视角

Imaging Neurosci (Camb). 2025 Jan 29;3. doi: 10.1162/imag_a_00458. eCollection 2025.

A natural language processing approach to support biomedical data harmonization: Leveraging large language models.一种支持生物医学数据协调的自然语言处理方法：利用大语言模型。

PLoS One. 2025 Jul 24;20(7):e0328262. doi: 10.1371/journal.pone.0328262. eCollection 2025.

Neuroimaging-based data-driven subtypes of spatiotemporal atrophy due to Parkinson's disease.基于神经影像学的帕金森病所致时空萎缩的数据驱动亚型

Brain Commun. 2025 Apr 16;7(2):fcaf146. doi: 10.1093/braincomms/fcaf146. eCollection 2025.

Limitations of nomogram models in predicting survival outcomes for glioma patients.列线图模型在预测神经胶质瘤患者生存结局方面的局限性。

Front Immunol. 2025 Mar 18;16:1547506. doi: 10.3389/fimmu.2025.1547506. eCollection 2025.

Development and validation of a machine learning model to predict cognitive behavioral therapy outcome in obsessive-compulsive disorder using clinical and neuroimaging data.使用临床和神经影像数据开发并验证用于预测强迫症认知行为治疗结果的机器学习模型。

medRxiv. 2025 Feb 14:2025.02.14.25322265. doi: 10.1101/2025.02.14.25322265.

Continuous Monitoring Enables Dynamic Biomarkers to Assess Resilience in Acute COVID-19 Patients.连续监测可使动态生物标志物用于评估急性新冠肺炎患者的恢复力。

J Clin Med. 2025 Feb 2;14(3):951. doi: 10.3390/jcm14030951.

From Serendipity to Precision: Integrating AI, Multi-Omics, and Human-Specific Models for Personalized Neuropsychiatric Care.从意外发现到精准医疗：整合人工智能、多组学和人类特异性模型以实现个性化神经精神疾病护理。

Biomedicines. 2025 Jan 12;13(1):167. doi: 10.3390/biomedicines13010167.

Interpretable and integrative deep learning for discovering brain-behaviour associations.用于发现脑-行为关联的可解释性和整合性深度学习。

Sci Rep. 2025 Jan 17;15(1):2312. doi: 10.1038/s41598-024-85032-5.

A lightweight generative model for interpretable subject-level prediction.一种用于可解释个体水平预测的轻量级生成模型。

Med Image Anal. 2025 Apr;101:103436. doi: 10.1016/j.media.2024.103436. Epub 2024 Dec 27.

Disentangled latent energy-based style translation: An image-level structural MRI harmonization framework.基于解缠潜在能量的风格转换：一种图像级结构磁共振成像协调框架。

Neural Netw. 2025 Apr;184:107039. doi: 10.1016/j.neunet.2024.107039. Epub 2024 Dec 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

检测和纠正多站点神经影像学数据集的偏差。

Detect and correct bias in multi-site neuroimaging datasets.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献