Ghasemzadeh Hamzeh, Hillman Robert E, Mehta Daryush D
Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital, Boston.
Department of Surgery, Harvard Medical School, Boston, MA.
J Speech Lang Hear Res. 2024 Mar 11;67(3):753-781. doi: 10.1044/2023_JSLHR-23-00273. Epub 2024 Feb 22.
Many studies using machine learning (ML) in speech, language, and hearing sciences rely upon cross-validations with single data splitting. This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust data splitting method of nested -fold cross-validation. The second purpose is to present methods and MATLAB code to perform power analysis for ML-based analysis during the design of a study.
First, the significant impact of different cross-validations on ML outcomes was demonstrated using real-world clinical data. Then, Monte Carlo simulations were used to quantify the interactions among the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, the dimensionality of the model, and the sample size. Four different cross-validation methods (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and confidence of the resulting ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome (5% significance) with 80% power. Statistical confidence of the model was defined as the probability of correct features being selected for inclusion in the final model.
ML models generated based on the single holdout method had very low statistical power and confidence, leading to overestimation of classification accuracy. Conversely, the nested 10-fold cross-validation method resulted in the highest statistical confidence and power while also providing an unbiased estimate of accuracy. The required sample size using the single holdout method could be 50% higher than what would be needed if nested -fold cross-validation were used. Statistical confidence in the model based on nested -fold cross-validation was as much as four times higher than the confidence obtained with the single holdout-based model. A computational model, MATLAB code, and lookup tables are provided to assist researchers with estimating the minimum sample size needed during study design.
The adoption of nested -fold cross-validation is critical for unbiased and robust ML studies in the speech, language, and hearing sciences.
许多在语音、语言和听力科学中使用机器学习(ML)的研究依赖于单次数据分割的交叉验证。本研究的首要目的是提供定量证据,促使研究人员转而使用更稳健的数据分割方法——嵌套折交叉验证。第二个目的是在研究设计过程中,介绍用于基于ML分析的功效分析方法及MATLAB代码。
首先,利用真实世界的临床数据证明不同交叉验证对ML结果的显著影响。然后,使用蒙特卡洛模拟来量化所采用的交叉验证方法、特征的判别能力、特征空间的维度、模型的维度和样本量之间的相互作用。基于所得ML模型的统计功效和置信度,比较了四种不同的交叉验证方法(单次留出法、10折法、训练-验证-测试法和嵌套10折法)。使用原假设和备择假设的分布来确定在功效为80%时获得具有统计学显著性结果(5%显著性水平)所需的最小样本量。模型的统计置信度定义为被选中纳入最终模型的正确特征的概率。
基于单次留出法生成的ML模型统计功效和置信度非常低,导致分类准确率被高估。相反,嵌套10折交叉验证方法产生了最高的统计置信度和功效,同时还提供了无偏的准确率估计。使用单次留出法所需的样本量可能比使用嵌套折交叉验证法所需的样本量高50%。基于嵌套折交叉验证的模型的统计置信度比基于单次留出法的模型获得的置信度高出四倍之多。提供了一个计算模型、MATLAB代码和查找表,以帮助研究人员估计研究设计期间所需的最小样本量。
采用嵌套折交叉验证对于语音、语言和听力科学中无偏且稳健的ML研究至关重要。