Yoon Hyunsoo, Schwedt Todd J, Chong Catherine D, Olatunde Oyekanmi, Wu Teresa
Department of Industrial Engineering, Yonsei University, Seoul, Republic of Korea.
Department of Neurology, Mayo Clinic, Scottsdale, Arizona, United States of America.
PLoS One. 2024 Dec 31;19(12):e0288300. doi: 10.1371/journal.pone.0288300. eCollection 2024.
Multicenter and multi-scanner imaging studies may be necessary to ensure sufficiently large sample sizes for developing accurate predictive models. However, multicenter studies, incorporating varying research participant characteristics, MRI scanners, and imaging acquisition protocols, may introduce confounding factors, potentially hindering the creation of generalizable machine learning models. Models developed using one dataset may not readily apply to another, emphasizing the importance of classification model generalizability in multi-scanner and multicenter studies for producing reproducible results. This study focuses on enhancing generalizability in classifying individual migraine patients and healthy controls using brain MRI data through a data harmonization strategy. We propose identifying a 'healthy core'-a group of homogeneous healthy controls with similar characteristics-from multicenter studies. The Maximum Mean Discrepancy (MMD) in Geodesic Flow Kernel (GFK) space is employed to compare two datasets, capturing data variabilities and facilitating the identification of this 'healthy core'. Homogeneous healthy controls play a vital role in mitigating unwanted heterogeneity, enabling the development of highly accurate classification models with improved performance on new datasets. Extensive experimental results underscore the benefits of leveraging a 'healthy core'. We utilized two datasets: one comprising 120 individuals (66 with migraine and 54 healthy controls), and another comprising 76 individuals (34 with migraine and 42 healthy controls). Notably, a homogeneous dataset derived from a cohort of healthy controls yielded a significant 25% accuracy improvement for both episodic and chronic migraineurs.
多中心和多扫描仪成像研究可能是必要的,以确保有足够大的样本量来开发准确的预测模型。然而,纳入不同研究参与者特征、MRI扫描仪和成像采集协议的多中心研究可能会引入混杂因素,潜在地阻碍可推广机器学习模型的创建。使用一个数据集开发的模型可能无法轻易应用于另一个数据集,这凸显了分类模型可推广性在多扫描仪和多中心研究中对于产生可重复结果的重要性。本研究聚焦于通过数据协调策略,利用脑MRI数据提高对个体偏头痛患者和健康对照进行分类时的可推广性。我们建议从多中心研究中识别出一个“健康核心”——一组具有相似特征的同质健康对照。利用测地线流核(GFK)空间中的最大均值差异(MMD)来比较两个数据集,捕捉数据变异性并便于识别这个“健康核心”。同质健康对照在减轻不必要的异质性方面起着至关重要的作用,能够开发出在新数据集上具有更高性能的高精度分类模型。大量实验结果强调了利用“健康核心”的益处。我们使用了两个数据集:一个包含120名个体(66名偏头痛患者和54名健康对照),另一个包含76名个体(34名偏头痛患者和42名健康对照)。值得注意的是,从一组健康对照中得出的同质数据集,对于发作性和慢性偏头痛患者,准确率显著提高了25%。