转化性脑电图深度学习研究中的数据泄露。

Data leakage in deep learning studies of translational EEG.

作者信息

Brookshire Geoffrey, Kasper Jake, Blauch Nicholas M, Wu Yunan Charles, Glatt Ryan, Merrill David A, Gerrol Spencer, Yoder Keith J, Quirk Colin, Lucero Ché

机构信息

SPARK Neuro Inc., New York, NY, United States.

Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, United States.

出版信息

Front Neurosci. 2024 May 3;18:1373515. doi: 10.3389/fnins.2024.1373515. eCollection 2024.

DOI:10.3389/fnins.2024.1373515

PMID:38765672

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11099244/

Abstract

A growing number of studies apply deep neural networks (DNNs) to recordings of human electroencephalography (EEG) to identify a range of disorders. In many studies, EEG recordings are split into segments, and each segment is randomly assigned to the training or test set. As a consequence, data from individual subjects appears in both the training and the test set. Could high test-set accuracy reflect data leakage from subject-specific patterns in the data, rather than patterns that identify a disease? We address this question by testing the performance of DNN classifiers using segment-based holdout (in which segments from one subject can appear in both the training and test set), and comparing this to their performance using subject-based holdout (where all segments from one subject appear exclusively in either the training set or the test set). In two datasets (one classifying Alzheimer's disease, and the other classifying epileptic seizures), we find that performance on previously-unseen subjects is strongly overestimated when models are trained using segment-based holdout. Finally, we survey the literature and find that the majority of translational DNN-EEG studies use segment-based holdout. Most published DNN-EEG studies may dramatically overestimate their classification performance on new subjects.

摘要

越来越多的研究将深度神经网络（DNN）应用于人类脑电图（EEG）记录，以识别一系列疾病。在许多研究中，EEG记录被分割成片段，每个片段被随机分配到训练集或测试集。因此，来自个体受试者的数据会同时出现在训练集和测试集中。高测试集准确率是否反映了数据中特定于受试者的模式导致的数据泄露，而不是识别疾病的模式？我们通过使用基于片段的留出法（其中来自一个受试者的片段可以同时出现在训练集和测试集中）测试DNN分类器的性能，并将其与使用基于受试者的留出法（其中来自一个受试者的所有片段仅出现在训练集或测试集中）的性能进行比较，来解决这个问题。在两个数据集（一个用于对阿尔茨海默病进行分类，另一个用于对癫痫发作进行分类）中，我们发现，当使用基于片段的留出法训练模型时，对以前未见过的受试者的性能被严重高估。最后，我们查阅了文献，发现大多数转化性DNN-EEG研究使用的是基于片段的留出法。大多数已发表的DNN-EEG研究可能会大幅高估其对新受试者的分类性能。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

转化性脑电图深度学习研究中的数据泄露。

Data leakage in deep learning studies of translational EEG.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

转化性脑电图深度学习研究中的数据泄露。

Data leakage in deep learning studies of translational EEG.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献