Suppr超能文献

机器学习中不平衡数据集的重采样技术比较:在局灶性癫痫患者发作间期颅内脑电图记录的致痫区定位中的应用

Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.

作者信息

Varotto Giulia, Susi Gianluca, Tassi Laura, Gozzo Francesca, Franceschetti Silvana, Panzica Ferruccio

机构信息

Epilepsy Unit, Bioengineering Group, Fondazione IRCCS Istituto Neurologico Carlo Besta, Milan, Italy.

Neurophysiopathology Unit, Fondazione IRCCS Istituto Neurologico Carlo Besta, Milan, Italy.

出版信息

Front Neuroinform. 2021 Nov 19;15:715421. doi: 10.3389/fninf.2021.715421. eCollection 2021.

Abstract

In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery. We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered. Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method. The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

摘要

在神经科学研究中,数据常常具有多数类和少数类之间分布不均衡的特征,这一问题可能会限制甚至恶化机器学习方法的预测性能。为解决这一问题,人们开发了不同的重采样程序,并且在比较它们在不同场景下的有效性方面已经做了大量工作。值得注意的是,此类技术的稳健性已在各种不同数据集上进行了测试,但未考虑每个特定数据集的性能。在本研究中,我们比较了不同重采样程序在接受手术的局灶性癫痫患者的立体脑电图(SEEG)记录的不平衡领域中的性能。我们考虑了从10例耐药性局灶性癫痫患者的发作间期SEEG进行网络分析获得的数据,用于一个监督分类问题,旨在区分发作间期条件下的致痫性和非致痫性脑区。我们使用10种不同的机器学习分类器研究了五种过采样和五种欠采样程序的有效性。此外,还测试了六种针对不平衡领域的特定集成方法。为比较性能,考虑了受试者工作特征曲线下面积(AUC)、F1值、几何均值和平衡准确率。两种重采样程序相对于原始数据集均表现出性能提升。发现过采样程序对所采用的分类方法类型更为敏感,自适应合成采样(ADASYN)表现出最佳性能。在不同分类器中,所有欠采样方法都比过采样更稳健,随机欠采样(RUS)尽管是最简单和最基本的分类方法,但表现出最佳性能。通过重采样考虑特征平衡的机器学习技术的应用是有益的,并且能够更准确地从发作间期定位致痫区。此外,我们的结果强调了分类方法类型与重采样一起使用以最大化对结果的益处的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f6/8641296/02e17793e68b/fninf-15-715421-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验