Scientific and Statistical Computing Core, National Institute of Mental Health, USA.
Section on Development and Affective Neuroscience, National Institute of Mental Health, USA.
Neuroimage. 2021 Dec 15;245:118647. doi: 10.1016/j.neuroimage.2021.118647. Epub 2021 Oct 22.
The concept of test-retest reliability indexes the consistency of a measurement across time. High reliability is critical for any scientific study, but specifically for the study of individual differences. Evidence of poor reliability of commonly used behavioral and functional neuroimaging tasks is mounting. Reports on low reliability of task-based fMRI have called into question the adequacy of using even the most common, well-characterized cognitive tasks with robust population-level effects, to measure individual differences. Here, we lay out a hierarchical framework that estimates reliability as a correlation divorced from trial-level variability, and show that reliability tends to be underestimated under the conventional intraclass correlation framework through summary statistics based on condition-level modeling. In addition, we examine how reliability estimation between the two statistical frameworks diverges and assess how different factors (e.g., trial and subject sample sizes, relative magnitude of cross-trial variability) impact reliability estimates. As empirical data indicate that cross-trial variability is large in most tasks, this work highlights that a large number of trials (e.g., greater than 100) may be required to achieve precise reliability estimates. We reference the tools TRR and 3dLMEr for the community to apply trial-level models to behavior and neuroimaging data and discuss how to make these new measurements most useful for future studies.
重测信度指数是指测量在时间上的一致性。高可靠性对于任何科学研究都是至关重要的,但对于个体差异的研究尤为重要。越来越多的证据表明,常用的行为和功能神经影像学任务的可靠性较差。关于任务态 fMRI 可靠性低的报告质疑了即使使用最常见、特征最明显且具有强大群体效应的认知任务来测量个体差异的充分性。在这里,我们提出了一个层次框架,该框架将可靠性估计为与试验水平变异性分离的相关性,并通过基于条件水平建模的汇总统计数据表明,在传统的组内相关框架下,可靠性往往被低估。此外,我们还研究了两种统计框架之间的可靠性估计如何存在差异,并评估了不同因素(例如,试验和被试样本量、跨试验变异性的相对大小)如何影响可靠性估计。由于经验数据表明,大多数任务中的跨试验变异性较大,因此这项工作强调需要大量试验(例如,大于 100 次)才能获得精确的可靠性估计。我们为社区提供了 TRR 和 3dLMEr 这两个工具,以便将试验水平的模型应用于行为和神经影像学数据,并讨论了如何使这些新测量对未来的研究最有用。