Combrisson Etienne, Jerbi Karim
DYCOG Lab, Lyon Neuroscience Research Center, INSERM U1028, UMR 5292, University Lyon I, Lyon, France; Center of Research and Innovation in Sport, Mental Processes and Motor Performance, University of Lyon I, Lyon, France.
DYCOG Lab, Lyon Neuroscience Research Center, INSERM U1028, UMR 5292, University Lyon I, Lyon, France; Psychology Department, University of Montreal, QC, Canada.
J Neurosci Methods. 2015 Jul 30;250:126-36. doi: 10.1016/j.jneumeth.2015.01.010. Epub 2015 Jan 14.
Machine learning techniques are increasingly used in neuroscience to classify brain signals. Decoding performance is reflected by how much the classification results depart from the rate achieved by purely random classification. In a 2-class or 4-class classification problem, the chance levels are thus 50% or 25% respectively. However, such thresholds hold for an infinite number of data samples but not for small data sets. While this limitation is widely recognized in the machine learning field, it is unfortunately sometimes still overlooked or ignored in the emerging field of brain signal classification. Incidentally, this field is often faced with the difficulty of low sample size. In this study we demonstrate how applying signal classification to Gaussian random signals can yield decoding accuracies of up to 70% or higher in two-class decoding with small sample sets. Most importantly, we provide a thorough quantification of the severity and the parameters affecting this limitation using simulations in which we manipulate sample size, class number, cross-validation parameters (k-fold, leave-one-out and repetition number) and classifier type (Linear-Discriminant Analysis, Naïve Bayesian and Support Vector Machine). In addition to raising a red flag of caution, we illustrate the use of analytical and empirical solutions (binomial formula and permutation tests) that tackle the problem by providing statistical significance levels (p-values) for the decoding accuracy, taking sample size into account. Finally, we illustrate the relevance of our simulations and statistical tests on real brain data by assessing noise-level classifications in Magnetoencephalography (MEG) and intracranial EEG (iEEG) baseline recordings.
机器学习技术在神经科学中越来越多地用于对脑信号进行分类。解码性能通过分类结果与纯随机分类所达到的比率的偏离程度来反映。在二分类或四分类问题中,因此机会水平分别为50%或25%。然而,这样的阈值适用于无限数量的数据样本,而不适用于小数据集。虽然这一局限性在机器学习领域已得到广泛认可,但不幸的是,在新兴的脑信号分类领域中,它有时仍被忽视或忽略。顺便说一句,该领域经常面临样本量小的困难。在本研究中,我们展示了将信号分类应用于高斯随机信号如何在小样本集的二分类解码中产生高达70%或更高的解码准确率。最重要的是,我们通过模拟对影响这一局限性的严重程度和参数进行了全面量化,在模拟中我们操纵样本量、类别数量、交叉验证参数(k折、留一法和重复次数)以及分类器类型(线性判别分析、朴素贝叶斯和支持向量机)。除了发出谨慎的警告外,我们还说明了通过提供考虑样本量的解码准确率的统计显著性水平(p值)来解决问题的分析和实证解决方案(二项式公式和排列检验)的使用。最后,我们通过评估脑磁图(MEG)和颅内脑电图(iEEG)基线记录中的噪声水平分类,说明了我们的模拟和统计测试对真实脑数据的相关性。