Department of Psychiatry and Behavioral Sciences, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA.
Neuroimage. 2011 May 15;56(2):517-24. doi: 10.1016/j.neuroimage.2010.05.065. Epub 2010 Jun 25.
Machine learning methods have been applied to classifying fMRI scans by studying locations in the brain that exhibit temporal intensity variation between groups, frequently reporting classification accuracy of 90% or better. Although empirical results are quite favorable, one might doubt the ability of classification methods to withstand changes in task ordering and the reproducibility of activation patterns over runs, and question how much of the classification machines' power is due to artifactual noise versus genuine neurological signal. To examine the true strength and power of machine learning classifiers we create and then deconstruct a classifier to examine its sensitivity to physiological noise, task reordering, and across-scan classification ability. The models are trained and tested both within and across runs to assess stability and reproducibility across conditions. We demonstrate the use of independent components analysis for both feature extraction and artifact removal and show that removal of such artifacts can reduce predictive accuracy even when data has been cleaned in the preprocessing stages. We demonstrate how mistakes in the feature selection process can cause the cross-validation error seen in publication to be a biased estimate of the testing error seen in practice and measure this bias by purposefully making flawed models. We discuss other ways to introduce bias and the statistical assumptions lying behind the data and model themselves. Finally we discuss the complications in drawing inference from the smaller sample sizes typically seen in fMRI studies, the effects of small or unbalanced samples on the Type 1 and Type 2 error rates, and how publication bias can give a false confidence of the power of such methods. Collectively this work identifies challenges specific to fMRI classification and methods affecting the stability of models.
机器学习方法已被应用于通过研究大脑中在组间表现出时间强度变化的位置来对 fMRI 扫描进行分类,经常报告分类准确率达到 90%或更高。尽管经验结果相当有利,但人们可能会怀疑分类方法在任务顺序变化和激活模式在运行中的可重复性方面的能力,并质疑分类机器的能力有多少归因于人为噪声而非真正的神经信号。为了检验机器学习分类器的真正强度和能力,我们创建并解构了一个分类器,以检查其对生理噪声、任务重新排序和跨扫描分类能力的敏感性。模型在运行内和运行间进行训练和测试,以评估条件之间的稳定性和可重复性。我们展示了独立成分分析在特征提取和去除伪影方面的应用,并表明即使在预处理阶段已经清理了数据,去除这些伪影也会降低预测准确性。我们展示了特征选择过程中的错误如何导致在出版物中看到的交叉验证错误成为在实践中看到的测试错误的有偏差估计,并通过有意构建有缺陷的模型来衡量这种偏差。我们讨论了引入偏差的其他方法以及数据和模型本身背后的统计假设。最后,我们讨论了从 fMRI 研究中通常看到的较小样本量中进行推断的复杂性、小样本或不平衡样本对第一类和第二类错误率的影响,以及出版偏倚如何错误地增加了对这些方法的能力的信心。总的来说,这项工作确定了特定于 fMRI 分类的挑战和影响模型稳定性的方法。