Saqib Mohammed, Horovitz Silvina G
University of Pennsylvania, Philadelphia, PA 19104, USA.
National Institute of Neurological Disorders and Strokes, National Institutes of Health, Bethesda, MD 20892, USA.
NeuroSci. 2024 Nov 29;5(4):600-613. doi: 10.3390/neurosci5040042.
Classification of disease and healthy volunteer cohorts provides a useful clinical alternative to traditional group statistics due to individualized, personalized predictions. Classifiers for neurodegenerative disease can be trained on structural MRI morphometry, but require large multi-scanner datasets, introducing confounding batch effects. We test ComBat, a common harmonization model, in an example application to classify subjects with Parkinson's disease from healthy volunteers and identify common pitfalls, including data leakage. We used a multi-dataset cohort of 372 subjects (216 with Parkinson's disease, 156 healthy volunteers) from 11 identified scanners. We extracted both FreeSurfer and the determinant of Jacobian morphometry to compare single-scanner and multi-scanner classification pipelines. We confirm the presence of batch effects by running single scanner classifiers which could achieve wildly divergent AUCs on scanner-specific datasets (mean:0.651 ± 0.144). Multi-scanner classifiers that considered neurobiological batch effects between sites could easily achieve a test AUC of 0.902, though pipelines that prevented data leakage could only achieve a test AUC of 0.550. We conclude that batch effects remain a major issue for classification problems, such that even impressive single-scanner classifiers are unlikely to generalize to multiple scanners, and that solving for batch effects in a classifier problem must avoid circularity and reporting overly optimistic results.
由于个性化、个体化预测,疾病和健康志愿者队列的分类为传统的组统计提供了一种有用的临床替代方法。神经退行性疾病的分类器可以通过结构MRI形态测量进行训练,但需要大型多扫描仪数据集,这会引入混杂的批次效应。我们在一个示例应用中测试了一种常见的归一化模型ComBat,以将帕金森病患者与健康志愿者进行分类,并识别常见的陷阱,包括数据泄露。我们使用了来自11台已识别扫描仪的372名受试者(216名帕金森病患者,156名健康志愿者)的多数据集队列。我们提取了FreeSurfer和雅可比行列式形态测量值,以比较单扫描仪和多扫描仪分类管道。我们通过运行单扫描仪分类器来确认批次效应的存在,这些分类器在特定扫描仪数据集上可能会获得差异很大的AUC(平均值:0.651±0.144)。考虑到不同站点之间神经生物学批次效应的多扫描仪分类器可以轻松实现0.902的测试AUC,尽管防止数据泄露的管道只能实现0.550的测试AUC。我们得出结论,批次效应仍然是分类问题的一个主要问题,以至于即使是令人印象深刻的单扫描仪分类器也不太可能推广到多个扫描仪,并且在分类器问题中解决批次效应必须避免循环性并报告过于乐观的结果。