Suppr超能文献

LVQ-SMOTE - 基于学习向量量化的生物医学数据合成少数类过采样技术。

LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data.

机构信息

Department of Natural Science and Engineering, Kanazawa University, Ishikawa 9200941, Japan.

出版信息

BioData Min. 2013 Oct 2;6(1):16. doi: 10.1186/1756-0381-6-16.

Abstract

BACKGROUND

Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.

RESULTS

Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

CONCLUSIONS

The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.

摘要

背景

基于合成少数类过采样技术(SMOTE)的过采样方法已被提出用于不平衡生物医学数据的分类问题。然而,现有的过采样方法的效果仅略优于或有时甚至劣于最简单的 SMOTE。为了提高 SMOTE 的有效性,本文提出了一种使用由学习矢量量化得到的码本的新型过采样方法。通常,即使将现有的 SMOTE 应用于生物医学数据集,其空特征空间仍然如此巨大,以至于大多数分类算法在估计类之间的边界时表现不佳。为了解决这个问题,我们的过采样方法生成的合成样本占据比其他 SMOTE 算法更多的特征空间。简而言之,我们的过采样方法通过参考从真实世界数据集中获取的实际样本,生成有用的合成样本。

结果

在八个真实的不平衡数据集上的实验表明,我们提出的过采样方法在五种标准分类算法中的四种上优于最简单的 SMOTE。此外,如果在我们的算法中使用最新的称为 MWMOTE 的 SMOTE,则可以看到我们方法的性能有所提高。用于β-转角类型预测的数据集上的实验显示了一些以前分析中未看到的重要模式。

结论

所提出的过采样方法为不平衡生物医学数据的分类生成了有用的合成样本。此外,所提出的过采样方法与基本分类算法和现有的过采样方法基本兼容。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f4b/4016036/a6971bd0ff10/1756-0381-6-16-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验