Mejia Amanda F, Nebel Mary Beth, Eloyan Ani, Caffo Brian, Lindquist Martin A
Department of Statistics, Indiana University, Bloomington, IN, USA.
Center for Neurodevelopmental and Imaging Research, Kennedy Krieger Institute, Baltimore, MD, USA.
Biostatistics. 2017 Jul 1;18(3):521-536. doi: 10.1093/biostatistics/kxw050.
Outlier detection for high-dimensional (HD) data is a popular topic in modern statistical research. However, one source of HD data that has received relatively little attention is functional magnetic resonance images (fMRI), which consists of hundreds of thousands of measurements sampled at hundreds of time points. At a time when the availability of fMRI data is rapidly growing-primarily through large, publicly available grassroots datasets-automated quality control and outlier detection methods are greatly needed. We propose principal components analysis (PCA) leverage and demonstrate how it can be used to identify outlying time points in an fMRI run. Furthermore, PCA leverage is a measure of the influence of each observation on the estimation of principal components, which are often of interest in fMRI data. We also propose an alternative measure, PCA robust distance, which is less sensitive to outliers and has controllable statistical properties. The proposed methods are validated through simulation studies and are shown to be highly accurate. We also conduct a reliability study using resting-state fMRI data from the Autism Brain Imaging Data Exchange and find that removal of outliers using the proposed methods results in more reliable estimation of subject-level resting-state networks using independent components analysis.
高维(HD)数据的异常值检测是现代统计研究中的一个热门话题。然而,一种相对较少受到关注的高维数据来源是功能磁共振成像(fMRI),它由在数百个时间点上采样的数十万次测量组成。在fMRI数据可用性迅速增长的时代——主要是通过大型的、公开可用的基层数据集——自动质量控制和异常值检测方法非常必要。我们提出主成分分析(PCA)杠杆率,并展示如何用它来识别fMRI扫描中的异常时间点。此外,PCA杠杆率是每个观测值对主成分估计影响的一种度量,而主成分在fMRI数据中通常是令人感兴趣的。我们还提出了一种替代度量,即PCA稳健距离,它对异常值不太敏感,并且具有可控的统计特性。所提出的方法通过模拟研究得到验证,并显示出高度的准确性。我们还使用来自自闭症脑成像数据交换库的静息态fMRI数据进行了一项可靠性研究,发现使用所提出的方法去除异常值会导致使用独立成分分析对个体水平的静息态网络进行更可靠的估计。