• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于增强不平衡数据集中水质分类的SMOTE主成分分析-高密度基于密度空间聚类方法。

A SMOTE PCA HDBSCAN approach for enhancing water quality classification in imbalanced datasets.

作者信息

Nasaruddin Norashikin, Masseran Nurulkamal, Idris Wan Mohd Razi, Ul-Saufie Ahmad Zia

机构信息

Department of Mathematical Sciences, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.

School of Mathematical Sciences, College of Computing, Informatics and Mathematics, Universiti Teknologi Mara (UiTM) Kedah Branch, 08400, Merbok, Kedah, Malaysia.

出版信息

Sci Rep. 2025 Apr 16;15(1):13059. doi: 10.1038/s41598-025-97248-0.

DOI:10.1038/s41598-025-97248-0
PMID:40240488
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12003838/
Abstract

Class imbalance poses a significant challenge in water quality classification, often leading to biased predictions and diminished accuracy for minority classes. This study introduces SMOTE-PCA-HDBSCAN, a novel oversampling framework that integrates the Synthetic Minority Oversampling Technique (SMOTE) to generate synthetic samples, Principal Component Analysis (PCA) to enhance data separability, and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to remove synthetic data noise. The cleaned synthetic data is then merged with the original dataset to form a balanced, noise-reduced training set. Comparative evaluations against SMOTE, SMOTE-DBSCAN, SMOTE-PCA-DBSCAN, SMOTE-ENN, and SMOTE-Tomek Links reveal that SMOTE-PCA-HDBSCAN consistently improves sensitivity for minority classes (Clean: 4.76% to 28.57%; Polluted: 38.09% to 61.90%) while maintaining high accuracy for the majority class. These results demonstrate the robustness of SMOTE-PCA-HDBSCAN in addressing class imbalance, offering a valuable tool for enhancing predictive models in environmental monitoring and other domains with imbalanced datasets.

摘要

类别不平衡在水质分类中构成了重大挑战,常常导致预测偏差以及少数类别准确率降低。本研究引入了SMOTE-PCA-HDBSCAN,这是一种新颖的过采样框架,它集成了合成少数类过采样技术(SMOTE)以生成合成样本、主成分分析(PCA)以增强数据可分离性,以及基于密度的带噪声应用层次聚类(HDBSCAN)以去除合成数据噪声。然后将清理后的合成数据与原始数据集合并,以形成一个平衡、降噪的训练集。与SMOTE、SMOTE-DBSCAN、SMOTE-PCA-DBSCAN、SMOTE-ENN和SMOTE-Tomek Links的对比评估表明,SMOTE-PCA-HDBSCAN持续提高少数类别的敏感性(清洁:从4.76%提高到28.57%;污染:从38.09%提高到61.90%),同时保持多数类别的高精度。这些结果证明了SMOTE-PCA-HDBSCAN在解决类别不平衡问题方面的稳健性,为增强环境监测及其他具有不平衡数据集的领域中的预测模型提供了一个有价值的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/c41f089db880/41598_2025_97248_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/a85d7fa4e18e/41598_2025_97248_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/dfdb239c7a02/41598_2025_97248_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/24e78473ef51/41598_2025_97248_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/c41f089db880/41598_2025_97248_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/a85d7fa4e18e/41598_2025_97248_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/dfdb239c7a02/41598_2025_97248_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/24e78473ef51/41598_2025_97248_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/19a8/12003838/c41f089db880/41598_2025_97248_Fig4_HTML.jpg

相似文献

1
A SMOTE PCA HDBSCAN approach for enhancing water quality classification in imbalanced datasets.一种用于增强不平衡数据集中水质分类的SMOTE主成分分析-高密度基于密度空间聚类方法。
Sci Rep. 2025 Apr 16;15(1):13059. doi: 10.1038/s41598-025-97248-0.
2
Addressing imbalanced data classification with Cluster-Based Reduced Noise SMOTE.基于聚类的降噪合成少数过采样技术解决不平衡数据分类问题
PLoS One. 2025 Feb 10;20(2):e0317396. doi: 10.1371/journal.pone.0317396. eCollection 2025.
3
Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。
Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.
4
A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.一种基于高斯混合模型滤波的合成少数类过采样技术用于不平衡数据分类
IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3740-3753. doi: 10.1109/TNNLS.2022.3197156. Epub 2024 Feb 29.
5
Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis.利用新型 GBO 和 SSG 增强和改进不平衡类数据的性能:比较分析。
Neural Netw. 2024 May;173:106157. doi: 10.1016/j.neunet.2024.106157. Epub 2024 Feb 2.
6
Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19.异常值合成少数过采样技术(Outlier-SMOTE):一种用于改进新冠病毒(COVID-19)检测的精细过采样技术。
Intell Based Med. 2020 Dec;3:100023. doi: 10.1016/j.ibmed.2020.100023. Epub 2020 Dec 3.
7
DBCSMOTE: a clustering-based oversampling technique for data-imbalanced warfarin dose prediction.DBCSMOTE:一种基于聚类的过采样技术,用于数据不平衡的华法林剂量预测。
BMC Med Genomics. 2020 Oct 22;13(Suppl 10):152. doi: 10.1186/s12920-020-00781-2.
8
Comparing Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition.比较处理人体活动识别中不平衡数据的采样策略。
Sensors (Basel). 2022 Feb 11;22(4):1373. doi: 10.3390/s22041373.
9
Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees.用于无刺蜂多类不平衡形态测量数据的数据增强和机器学习算法
Heliyon. 2025 Jan 23;11(3):e42214. doi: 10.1016/j.heliyon.2025.e42214. eCollection 2025 Feb 15.
10
SMOTE for high-dimensional class-imbalanced data.过采样处理高维类别不平衡数据。
BMC Bioinformatics. 2013 Mar 22;14:106. doi: 10.1186/1471-2105-14-106.

本文引用的文献

1
A review of the application of machine learning in water quality evaluation.机器学习在水质评价中的应用综述。
Eco Environ Health. 2022 Jul 8;1(2):107-116. doi: 10.1016/j.eehl.2022.06.001. eCollection 2022 Jun.
2
Serum Lipoprotein(a) and High-Density Lipoprotein Cholesterol Associate with Diabetic Nephropathy: Evidence from Machine Learning Perspectives.血清脂蛋白(a)和高密度脂蛋白胆固醇与糖尿病肾病相关:来自机器学习视角的证据。
Diabetes Metab Syndr Obes. 2023 Jun 22;16:1847-1858. doi: 10.2147/DMSO.S409410. eCollection 2023.
3
An Invitation to Greater Use of Matthews Correlation Coefficient in Robotics and Artificial Intelligence.
关于在机器人技术和人工智能中更多地使用马修斯相关系数的邀请。
Front Robot AI. 2022 Mar 25;9:876814. doi: 10.3389/frobt.2022.876814. eCollection 2022.
4
Confidence interval for micro-averaged and macro-averaged scores.微观平均和宏观平均分数的置信区间。
Appl Intell (Dordr). 2022 Mar;52(5):4961-4972. doi: 10.1007/s10489-021-02635-5. Epub 2021 Jul 31.
5
Water quality index modeling using random forest and improved SMO algorithm for support vector machine in Saf-Saf river basin.利用随机森林和改进的 SMO 算法对 Saf-Saf 河流域进行水质指数建模的支持向量机。
Environ Sci Pollut Res Int. 2022 Jul;29(32):48491-48508. doi: 10.1007/s11356-022-18644-x. Epub 2022 Feb 22.
6
Spatial heterogeneity modeling of water quality based on random forest regression and model interpretation.基于随机森林回归和模型解释的水质空间异质性建模。
Environ Res. 2021 Nov;202:111660. doi: 10.1016/j.envres.2021.111660. Epub 2021 Jul 12.
7
Random Forest.随机森林
J Insur Med. 2017;47(1):31-39. doi: 10.17849/insm-47-01-31-39.1.