Department of Earth and Environmental Studies, Montclair State University, New Jersey, USA; The Center for Artificial Intelligence and Environmental Sustainability (CAIES) Foundation, Patna, Bihar, India.
Department of Earth and Environmental Studies, Montclair State University, New Jersey, USA.
Ecotoxicol Environ Saf. 2022 Mar 1;232:113271. doi: 10.1016/j.ecoenv.2022.113271. Epub 2022 Feb 1.
This study evaluates state-of-the-art machine learning models in predicting the most sustainable arsenic mitigation preference. A Gaussian distribution-based Naïve Bayes (NB) classifier scored the highest Area Under the Curve (AUC) of the Receiver Operating Characteristic curve (0.82), followed by Nu Support Vector Classification (0.80), and K-Neighbors (0.79). Ensemble classifiers scored higher than 70% AUC, with Random Forest being the top performer (0.77), and Decision Tree model ranked fourth with an AUC of 0.77. The multilayer perceptron model also achieved high performance (AUC=0.75). Most linear classifiers underperformed, with the Ridge classifier at the top (AUC=0.73) and perceptron at the bottom (AUC=0.57). A Bernoulli distribution-based Naïve Bayes classifier was the poorest model (AUC=0.50). The Gaussian NB was also the most robust ML model with the slightest variation of Kappa score on training (0.58) and test data (0.64). The results suggest that nonlinear or ensemble classifiers could more accurately understand the complex relationships of socio-environmental data and help develop accurate and robust prediction models of sustainable arsenic mitigation. Furthermore, Gaussian NB is the best option when data is scarce.
本研究评估了最先进的机器学习模型在预测最可持续的砷缓解偏好方面的表现。基于高斯分布的朴素贝叶斯(NB)分类器在接收者操作特征曲线(ROC)的曲线下面积(AUC)方面得分最高(0.82),其次是 Nu 支持向量分类(0.80)和 K-最近邻(0.79)。集成分类器的 AUC 得分高于 70%,其中随机森林表现最佳(0.77),决策树模型排名第四,AUC 为 0.77。多层感知机模型也表现出较高的性能(AUC=0.75)。大多数线性分类器表现不佳,岭分类器排名最高(AUC=0.73),感知器排名最低(AUC=0.57)。基于伯努利分布的朴素贝叶斯分类器是表现最差的模型(AUC=0.50)。高斯 NB 也是最稳健的 ML 模型,在训练数据(0.58)和测试数据(0.64)上 Kappa 得分的变化最小。结果表明,非线性或集成分类器可以更准确地理解社会环境数据的复杂关系,并有助于开发准确和稳健的可持续砷缓解预测模型。此外,当数据稀缺时,高斯 NB 是最佳选择。