基于合成少数过采样技术和可解释人工智能的水质可饮用性预测模型的机器学习。

A Machine Learning-Based Water Potability Prediction Model by Using Synthetic Minority Oversampling Technique and Explainable AI.

机构信息

Department of Computer Science and Engineering Pandit Deendayal Energy University, Gandhinagar, Gujarat, India.

College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Saudi Arabia.

出版信息

Comput Intell Neurosci. 2022 Sep 20;2022:9283293. doi: 10.1155/2022/9283293. eCollection 2022.

DOI:10.1155/2022/9283293

PMID:36177311

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9514946/

Abstract

During the last few decades, the quality of water has deteriorated significantly due to pollution and many other issues. As a consequence of this, there is a need for a model that can make accurate projections about water quality. This work shows the comparative analysis of different machine learning approaches like Support Vector Machine (SVM), Decision Tree (DT), Random Forest, Gradient Boost, and Ada Boost, used for the water quality classification. The model is trained on the Water Quality Index dataset available on Kaggle. Z-score is used to normalize the dataset before beginning the training process for the model. Because the given dataset is unbalanced, Synthetic Minority Oversampling Technique (SMOTE) is used to balance the dataset. Experiments results depict that Random Forest and Gradient Boost give the highest accuracy of 81%. One of the major issues with the machine learning model is lack of transparency which makes it impossible to evaluate the results of the model. To address this issue, explainable AI (XAI) is used which assists us in determining which features are the most important. Within the context of this investigation, Local Interpretable Model-agnostic Explanations (LIME) is utilized to ascertain the significance of the features.

摘要

在过去的几十年中，由于污染和许多其他问题，水质显著恶化。因此，需要一个能够对水质进行准确预测的模型。本工作展示了支持向量机（SVM）、决策树（DT）、随机森林、梯度提升和 AdaBoost 等不同机器学习方法的比较分析，这些方法用于水质分类。该模型在 Kaggle 上提供的水质指数数据集上进行训练。在开始模型训练之前，使用 Z 分数对数据集进行标准化。由于给定数据集不平衡，因此使用合成少数过采样技术（SMOTE）对数据集进行平衡。实验结果表明，随机森林和梯度提升的准确率最高，为 81%。机器学习模型的一个主要问题是缺乏透明度，这使得无法评估模型的结果。为了解决这个问题，使用了可解释人工智能（XAI），它帮助我们确定哪些特征是最重要的。在本研究的背景下，使用局部可解释模型不可知解释（LIME）来确定特征的重要性。