Suppr超能文献

一种用于增强不平衡数据集中水质分类的SMOTE主成分分析-高密度基于密度空间聚类方法。

A SMOTE PCA HDBSCAN approach for enhancing water quality classification in imbalanced datasets.

作者信息

Nasaruddin Norashikin, Masseran Nurulkamal, Idris Wan Mohd Razi, Ul-Saufie Ahmad Zia

机构信息

Department of Mathematical Sciences, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.

School of Mathematical Sciences, College of Computing, Informatics and Mathematics, Universiti Teknologi Mara (UiTM) Kedah Branch, 08400, Merbok, Kedah, Malaysia.

出版信息

Sci Rep. 2025 Apr 16;15(1):13059. doi: 10.1038/s41598-025-97248-0.

Abstract

Class imbalance poses a significant challenge in water quality classification, often leading to biased predictions and diminished accuracy for minority classes. This study introduces SMOTE-PCA-HDBSCAN, a novel oversampling framework that integrates the Synthetic Minority Oversampling Technique (SMOTE) to generate synthetic samples, Principal Component Analysis (PCA) to enhance data separability, and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to remove synthetic data noise. The cleaned synthetic data is then merged with the original dataset to form a balanced, noise-reduced training set. Comparative evaluations against SMOTE, SMOTE-DBSCAN, SMOTE-PCA-DBSCAN, SMOTE-ENN, and SMOTE-Tomek Links reveal that SMOTE-PCA-HDBSCAN consistently improves sensitivity for minority classes (Clean: 4.76% to 28.57%; Polluted: 38.09% to 61.90%) while maintaining high accuracy for the majority class. These results demonstrate the robustness of SMOTE-PCA-HDBSCAN in addressing class imbalance, offering a valuable tool for enhancing predictive models in environmental monitoring and other domains with imbalanced datasets.

摘要

类别不平衡在水质分类中构成了重大挑战,常常导致预测偏差以及少数类别准确率降低。本研究引入了SMOTE-PCA-HDBSCAN,这是一种新颖的过采样框架,它集成了合成少数类过采样技术(SMOTE)以生成合成样本、主成分分析(PCA)以增强数据可分离性,以及基于密度的带噪声应用层次聚类(HDBSCAN)以去除合成数据噪声。然后将清理后的合成数据与原始数据集合并,以形成一个平衡、降噪的训练集。与SMOTE、SMOTE-DBSCAN、SMOTE-PCA-DBSCAN、SMOTE-ENN和SMOTE-Tomek Links的对比评估表明,SMOTE-PCA-HDBSCAN持续提高少数类别的敏感性(清洁:从4.76%提高到28.57%;污染:从38.09%提高到61.90%),同时保持多数类别的高精度。这些结果证明了SMOTE-PCA-HDBSCAN在解决类别不平衡问题方面的稳健性,为增强环境监测及其他具有不平衡数据集的领域中的预测模型提供了一个有价值的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/a85d7fa4e18e/41598_2025_97248_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验