Carrle Friedrich Philipp, Hollenbenders Yasmin, Reichenbach Alexandra
Center for Machine Learning, Heilbronn University, Heilbronn, Germany.
Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, Germany.
Front Neurosci. 2023 Oct 2;17:1219133. doi: 10.3389/fnins.2023.1219133. eCollection 2023.
Major depressive disorder (MDD) is the most common mental disorder worldwide, leading to impairment in quality and independence of life. Electroencephalography (EEG) biomarkers processed with machine learning (ML) algorithms have been explored for objective diagnoses with promising results. However, the generalizability of those models, a prerequisite for clinical application, is restricted by small datasets. One approach to train ML models with good generalizability is complementing the original with synthetic data produced by generative algorithms. Another advantage of synthetic data is the possibility of publishing the data for other researchers without risking patient data privacy. Synthetic EEG time-series have not yet been generated for two clinical populations like MDD patients and healthy controls.
We first reviewed 27 studies presenting EEG data augmentation with generative algorithms for classification tasks, like diagnosis, for the possibilities and shortcomings of recent methods. The subsequent empirical study generated EEG time-series based on two public datasets with 30/28 and 24/29 subjects (MDD/controls). To obtain baseline diagnostic accuracies, convolutional neural networks (CNN) were trained with time-series from each dataset. The data were synthesized with generative adversarial networks (GAN) consisting of CNNs. We evaluated the synthetic data qualitatively and quantitatively and finally used it for re-training the diagnostic model.
The reviewed studies improved their classification accuracies by between 1 and 40% with the synthetic data. Our own diagnostic accuracy improved up to 10% for one dataset but not significantly for the other. We found a rich repertoire of generative models in the reviewed literature, solving various technical issues. A major shortcoming in the field is the lack of meaningful evaluation metrics for synthetic data. The few studies analyzing the data in the frequency domain, including our own, show that only some features can be produced truthfully.
The systematic review combined with our own investigation provides an overview of the available methods for generating EEG data for a classification task, their possibilities, and shortcomings. The approach is promising and the technical basis is set. For a broad application of these techniques in neuroscience research or clinical application, the methods need fine-tuning facilitated by domain expertise in (clinical) EEG research.
重度抑郁症(MDD)是全球最常见的精神障碍,会导致生活质量和独立性受损。人们已经探索了使用机器学习(ML)算法处理脑电图(EEG)生物标志物以进行客观诊断,并取得了有前景的结果。然而,这些模型的可推广性(临床应用的一个先决条件)受到小数据集的限制。训练具有良好可推广性的ML模型的一种方法是用生成算法产生的合成数据来补充原始数据。合成数据的另一个优点是有可能将数据发布给其他研究人员,而不会冒患者数据隐私泄露的风险。尚未针对像MDD患者和健康对照这样的两个临床群体生成合成EEG时间序列。
我们首先回顾了27项研究,这些研究展示了使用生成算法进行EEG数据增强以用于分类任务(如诊断),分析了近期方法的可能性和缺点。随后的实证研究基于两个分别有30/28和24/29名受试者(MDD/对照)的公共数据集生成了EEG时间序列。为了获得基线诊断准确率,使用来自每个数据集的时间序列训练卷积神经网络(CNN)。数据是用由CNN组成的生成对抗网络(GAN)合成的。我们对合成数据进行了定性和定量评估,最后将其用于重新训练诊断模型。
经审查的研究使用合成数据后分类准确率提高了1%至40%。我们自己的诊断准确率在一个数据集上提高了高达10%,但在另一个数据集上没有显著提高。我们在经审查的文献中发现了丰富的生成模型库,解决了各种技术问题。该领域的一个主要缺点是缺乏针对合成数据的有意义的评估指标。包括我们自己的研究在内,少数在频域分析数据的研究表明,只有一些特征能够被真实地生成。
系统综述结合我们自己的调查,概述了用于为分类任务生成EEG数据的现有方法、它们的可能性和缺点。该方法很有前景,技术基础已经奠定。为了这些技术在神经科学研究或临床应用中的广泛应用,需要(临床)EEG研究领域的专业知识来对方法进行微调。