Department of Biomedical Engineering, Yale University, New Haven, CT, USA.
Interdepartmental Neuroscience Program, Yale University, New Haven, CT, USA.
Nat Hum Behav. 2024 Oct;8(10):2018-2033. doi: 10.1038/s41562-024-01931-7. Epub 2024 Jul 31.
Brain-phenotype predictive models seek to identify reproducible and generalizable brain-phenotype associations. External validation, or the evaluation of a model in external datasets, is the gold standard in evaluating the generalizability of models in neuroimaging. Unlike typical studies, external validation involves two sample sizes: the training and the external sample sizes. Thus, traditional power calculations may not be appropriate. Here we ran over 900 million resampling-based simulations in functional and structural connectivity data to investigate the relationship between training sample size, external sample size, phenotype effect size, theoretical power and simulated power. Our analysis included a wide range of datasets: the Healthy Brain Network, the Adolescent Brain Cognitive Development Study, the Human Connectome Project (Development and Young Adult), the Philadelphia Neurodevelopmental Cohort, the Queensland Twin Adolescent Brain Project, and the Chinese Human Connectome Project; and phenotypes: age, body mass index, matrix reasoning, working memory, attention problems, anxiety/depression symptoms and relational processing. High effect size predictions achieved adequate power with training and external sample sizes of a few hundred individuals, whereas low and medium effect size predictions required hundreds to thousands of training and external samples. In addition, most previous external validation studies used sample sizes prone to low power, and theoretical power curves should be adjusted for the training sample size. Furthermore, model performance in internal validation often informed subsequent external validation performance (Pearson's r difference <0.2), particularly for well-harmonized datasets. These results could help decide how to power future external validation studies.
脑表型预测模型旨在识别可重复和可推广的脑表型关联。外部验证,即模型在外部数据集上的评估,是评估神经影像学模型可推广性的金标准。与典型研究不同,外部验证涉及两个样本量:训练样本量和外部样本量。因此,传统的功效计算可能不适用。在这里,我们在功能和结构连接数据中运行了超过 9 亿次基于重采样的模拟,以研究训练样本量、外部样本量、表型效应大小、理论功效和模拟功效之间的关系。我们的分析包括广泛的数据集:健康大脑网络、青少年大脑认知发展研究、人类连接组计划(发展和年轻成人)、费城神经发育队列、昆士兰双胞胎青少年大脑项目和中国人类连接组计划;以及表型:年龄、体重指数、矩阵推理、工作记忆、注意力问题、焦虑/抑郁症状和关系处理。高效应大小预测在训练和外部样本量为几百人的情况下达到了足够的功效,而低和中效应大小预测则需要几百到几千个训练和外部样本。此外,大多数以前的外部验证研究使用了容易出现低功效的样本量,并且应该针对训练样本量调整理论功效曲线。此外,内部验证中的模型性能通常会影响随后的外部验证性能(Pearson's r 差异<0.2),特别是对于协调良好的数据集。这些结果可以帮助决定如何为未来的外部验证研究提供动力。