Suppr超能文献

基于不平衡数据的高斯核 SVM 参数的有效选择。

Efficient Selection of Gaussian Kernel SVM Parameters for Imbalanced Data.

机构信息

Division of Biometry, Department of Agronomy, National Taiwan University, Taipei 106216, Taiwan.

出版信息

Genes (Basel). 2023 Feb 25;14(3):583. doi: 10.3390/genes14030583.

Abstract

For medical data mining, the development of a class prediction model has been widely used to deal with various kinds of data classification problems. Classification models especially for high-dimensional gene expression datasets have attracted many researchers in order to identify marker genes for distinguishing any type of cancer cells from their corresponding normal cells. However, skewed class distributions often occur in the medical datasets in which at least one of the classes has a relatively small number of observations. A classifier induced by such an imbalanced dataset typically has a high accuracy for the majority class and poor prediction for the minority class. In this study, we focus on an SVM classifier with a Gaussian radial basis kernel for a binary classification problem. In order to take advantage of an SVM and to achieve the best generalization ability for improving the classification performance, we will address two important problems: the class imbalance and parameter selection during SVM parameter optimization. First of all, we proposed a novel adjustment method called b-SVM, for adjusting the cutoff threshold of the SVM. Second, we proposed a fast and simple approach, called the Min-max gamma selection, to optimize the model parameters of SVMs without carrying out an extensive k-fold cross validation. An extensive comparison with a standard SVM and well-known existing methods are carried out to evaluate the performance of our proposed algorithms using simulated and real datasets. The experimental results show that our proposed algorithms outperform the over-sampling techniques and existing SVM-based solutions. This study also shows that the proposed Min-max gamma selection is at least 10 times faster than the cross-validation selection based on the average running time on six real datasets.

摘要

对于医学数据挖掘,开发类别预测模型已被广泛用于处理各种数据分类问题。分类模型,特别是针对高维基因表达数据集的分类模型,吸引了许多研究人员,旨在识别标记基因,以便将任何类型的癌细胞与其相应的正常细胞区分开来。然而,医学数据集中经常出现类别分布不均衡的情况,至少有一个类别观测值较少。由这种不平衡数据集诱导的分类器通常对多数类别的准确性较高,而对少数类别的预测效果较差。在本研究中,我们专注于使用高斯径向基核的 SVM 分类器解决二分类问题。为了充分利用 SVM 并实现最佳的泛化能力以提高分类性能,我们将解决两个重要问题:SVM 参数优化过程中的类别不均衡和参数选择。首先,我们提出了一种称为 b-SVM 的新调整方法,用于调整 SVM 的截止阈值。其次,我们提出了一种快速而简单的方法,称为 Min-max gamma 选择,用于优化 SVM 的模型参数,而无需进行广泛的 k 折交叉验证。通过使用模拟数据集和真实数据集对我们提出的算法与标准 SVM 和知名现有方法进行了广泛比较,以评估我们提出的算法的性能。实验结果表明,我们提出的算法优于过采样技术和现有的基于 SVM 的解决方案。本研究还表明,与基于平均运行时间的交叉验证选择相比,提出的 Min-max gamma 选择至少快 10 倍,在六个真实数据集上进行测试。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3174/10048125/cf2917ef8e29/genes-14-00583-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验