LVQ-SMOTE - 基于学习向量量化的生物医学数据合成少数类过采样技术。

LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data.

机构信息

Department of Natural Science and Engineering, Kanazawa University, Ishikawa 9200941, Japan.

出版信息

BioData Min. 2013 Oct 2;6(1):16. doi: 10.1186/1756-0381-6-16.

DOI:10.1186/1756-0381-6-16

PMID:24088532

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4016036/

Abstract

BACKGROUND

Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.

RESULTS

Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

CONCLUSIONS

The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.

摘要

背景

基于合成少数类过采样技术（SMOTE）的过采样方法已被提出用于不平衡生物医学数据的分类问题。然而，现有的过采样方法的效果仅略优于或有时甚至劣于最简单的 SMOTE。为了提高 SMOTE 的有效性，本文提出了一种使用由学习矢量量化得到的码本的新型过采样方法。通常，即使将现有的 SMOTE 应用于生物医学数据集，其空特征空间仍然如此巨大，以至于大多数分类算法在估计类之间的边界时表现不佳。为了解决这个问题，我们的过采样方法生成的合成样本占据比其他 SMOTE 算法更多的特征空间。简而言之，我们的过采样方法通过参考从真实世界数据集中获取的实际样本，生成有用的合成样本。

结果

在八个真实的不平衡数据集上的实验表明，我们提出的过采样方法在五种标准分类算法中的四种上优于最简单的 SMOTE。此外，如果在我们的算法中使用最新的称为 MWMOTE 的 SMOTE，则可以看到我们方法的性能有所提高。用于β-转角类型预测的数据集上的实验显示了一些以前分析中未看到的重要模式。

结论

所提出的过采样方法为不平衡生物医学数据的分类生成了有用的合成样本。此外，所提出的过采样方法与基本分类算法和现有的过采样方法基本兼容。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f4b/4016036/a6971bd0ff10/1756-0381-6-16-1.jpg

相似文献

LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data.LVQ-SMOTE - 基于学习向量量化的生物医学数据合成少数类过采样技术。

BioData Min. 2013 Oct 2;6(1):16. doi: 10.1186/1756-0381-6-16.

Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19.异常值合成少数过采样技术（Outlier-SMOTE）：一种用于改进新冠病毒（COVID-19）检测的精细过采样技术。

Intell Based Med. 2020 Dec;3:100023. doi: 10.1016/j.ibmed.2020.100023. Epub 2020 Dec 3.

RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE：提升不平衡医学数据集的分类性能

Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.

A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare.一种用于医疗保健中高度不平衡数据分类的自检测自适应合成少数过采样技术算法（SASMOTE）。

BioData Min. 2023 Apr 25;16(1):15. doi: 10.1186/s13040-023-00330-4.

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.一种有效的算法与合成少数过采样技术相结合，用于对不平衡的 PubChem BioAssay 数据进行分类。

Anal Chim Acta. 2014 Jan 2;806:117-27. doi: 10.1016/j.aca.2013.10.050. Epub 2013 Nov 6.

A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data.基于随机森林的 M-SMOTE 与ENN 混合采样算法在医学不平衡数据中的应用

J Biomed Inform. 2020 Jul;107:103465. doi: 10.1016/j.jbi.2020.103465. Epub 2020 Jun 5.

A novel method for detecting credit card fraud problems.一种用于检测信用卡欺诈问题的新方法。

PLoS One. 2024 Mar 6;19(3):e0294537. doi: 10.1371/journal.pone.0294537. eCollection 2024.

Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification.基于自适应群体聚类的动态多目标合成少数类过采样技术算法，用于处理生物医学数据分类中的二元不平衡数据集。

BioData Min. 2016 Dec 1;9:37. doi: 10.1186/s13040-016-0117-1. eCollection 2016.

Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines.支持向量机核空间中基于过采样的不平衡数据分类

IEEE Trans Neural Netw Learn Syst. 2018 Sep;29(9):4065-4076. doi: 10.1109/TNNLS.2017.2751612. Epub 2017 Oct 10.

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.基于随机森林的用于特征选择和参数优化的CURE-SMOTE算法及混合算法。

BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.

引用本文的文献

Multiparametric magnetic resonance imaging of deep learning-based super-resolution reconstruction for predicting histopathologic grade in hepatocellular carcinoma.基于深度学习超分辨率重建的多参数磁共振成像预测肝细胞癌组织病理学分级

World J Gastroenterol. 2025 Sep 14;31(34):111541. doi: 10.3748/wjg.v31.i34.111541.

Contrast-enhanced mammography-based interpretable machine learning model for the prediction of the molecular subtype breast cancers.基于对比增强乳腺X线摄影的可解释机器学习模型用于预测乳腺癌分子亚型

BMC Med Imaging. 2025 Jul 1;25(1):255. doi: 10.1186/s12880-025-01765-3.

Evaluating machine learning algorithms for predicting HIV status among young Thai men who have sex with men.评估机器学习算法在预测泰国男男性行为者的艾滋病毒感染状况中的应用。

BMJ Health Care Inform. 2025 May 15;32(1):e101189. doi: 10.1136/bmjhci-2024-101189.

Application of machine learning for the analysis of peripheral blood biomarkers in oral mucosal diseases: a cross-sectional study.机器学习在口腔黏膜疾病外周血生物标志物分析中的应用：一项横断面研究。

BMC Oral Health. 2025 May 10;25(1):703. doi: 10.1186/s12903-025-06095-y.

Inferring kinase-phosphosite regulation from phosphoproteome-enriched cancer multi-omics datasets.从富含磷酸化蛋白质组的癌症多组学数据集中推断激酶-磷酸化位点调控。

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf143.

Predicting Neoplastic Polyp in Patients With Gallbladder Polyps Using Interpretable Machine Learning Models: Retrospective Cohort Study.使用可解释机器学习模型预测胆囊息肉患者的肿瘤性息肉：回顾性队列研究

Cancer Med. 2025 Mar;14(5):e70739. doi: 10.1002/cam4.70739.

Web-based machine learning application for interpretable prediction of prolonged length of stay after lumbar spinal stenosis surgery: a retrospective cohort study with explainable AI.基于网络的机器学习应用程序用于腰椎管狭窄症手术后住院时间延长的可解释预测：一项使用可解释人工智能的回顾性队列研究

Front Physiol. 2025 Feb 19;16:1542240. doi: 10.3389/fphys.2025.1542240. eCollection 2025.

Developing practical machine learning survival models to identify high-risk patients for in-hospital mortality following traumatic brain injury.开发实用的机器学习生存模型，以识别创伤性脑损伤后院内死亡的高危患者。

Sci Rep. 2025 Feb 18;15(1):5913. doi: 10.1038/s41598-025-89574-0.

Comprehensive cross cancer analyses reveal mutational signature cancer specificity.全面的跨癌症分析揭示了突变特征的癌症特异性。

Quant Biol. 2024 Sep;12(3):245-254. doi: 10.1002/qub2.49. Epub 2024 Jun 5.

Predicting symptomatic kidney stones using machine learning algorithms: insights from the Fasa adults cohort study (FACS).使用机器学习算法预测有症状肾结石：法萨成年人队列研究（FACS）的见解。

BMC Res Notes. 2024 Oct 24;17(1):318. doi: 10.1186/s13104-024-06979-2.

本文引用的文献

Prediction of β-turn types in protein by using composite vector.利用复合向量预测蛋白质中的 β-转角类型

J Theor Biol. 2011 Oct 7;286(1):24-30. doi: 10.1016/j.jtbi.2011.07.001. Epub 2011 Jul 19.

RAMOBoost: Ranked Minority Oversampling in Boosting.RAMOBoost：提升中的排序少数类过采样

IEEE Trans Neural Netw. 2010 Oct;21(10):1624-42. doi: 10.1109/TNN.2010.2066988. Epub 2010 Aug 30.

Predicting beta-turns and their types using predicted backbone dihedral angles and secondary structures.利用预测的骨架扭转角和二级结构预测 β-转角及其类型。

BMC Bioinformatics. 2010 Jul 31;11:407. doi: 10.1186/1471-2105-11-407.

Predicting protein-protein interactions in unbalanced data using the primary structure of proteins.利用蛋白质的一级结构预测不平衡数据中的蛋白质-蛋白质相互作用。

BMC Bioinformatics. 2010 Apr 2;11:167. doi: 10.1186/1471-2105-11-167.

microPred: effective classification of pre-miRNAs for human miRNA gene prediction.microPred：用于人类miRNA基因预测的前体miRNA有效分类

Bioinformatics. 2009 Apr 15;25(8):989-95. doi: 10.1093/bioinformatics/btp107. Epub 2009 Feb 20.

High accuracy prediction of beta-turns and their types using propensities and multiple alignments.利用倾向和多重比对对β-转角及其类型进行高精度预测。

Proteins. 2005 Jun 1;59(4):828-39. doi: 10.1002/prot.20461.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.癌症的分子分类：通过基因表达监测进行类别发现和类别预测。

Science. 1999 Oct 15;286(5439):531-7. doi: 10.1126/science.286.5439.531.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.通过寡核苷酸阵列探测的肿瘤和正常结肠组织的聚类分析所揭示的基因表达广泛模式。

Proc Natl Acad Sci U S A. 1999 Jun 8;96(12):6745-50. doi: 10.1073/pnas.96.12.6745.

A revised set of potentials for beta-turn formation in proteins.一组经修订的蛋白质中β-转角形成的势能。

Protein Sci. 1994 Dec;3(12):2207-16. doi: 10.1002/pro.5560031206.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

LVQ-SMOTE - 基于学习向量量化的生物医学数据合成少数类过采样技术。

LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献