Key Laboratory of Groundwater Resources and Environment, Ministry of Education, Jilin University Changchun 130021, China; Jilin Provincial Key Laboratory of Water Resources and Environment, Jilin University Changchun 130021, China; College of New Energy and Environment, Jilin University, Changchun 130021, China; National-Local Joint Engineering Laboratory of In-Situ Conversion, Drilling and Exploitation Technology for Oil Shale, Changchun 130021, China.
Key Laboratory of Groundwater Resources and Environment, Ministry of Education, Jilin University Changchun 130021, China; Jilin Provincial Key Laboratory of Water Resources and Environment, Jilin University Changchun 130021, China; College of New Energy and Environment, Jilin University, Changchun 130021, China; National-Local Joint Engineering Laboratory of In-Situ Conversion, Drilling and Exploitation Technology for Oil Shale, Changchun 130021, China.
Water Res. 2024 Dec 1;267:122498. doi: 10.1016/j.watres.2024.122498. Epub 2024 Sep 21.
The increasing pollution of aquifers by human activities over recent decades poses a threat to drinking water safety. While Gaussian Process Regression (GPR) is a robust tool for predicting and monitoring water quality, its effectiveness is hindered limitations of available data on model training and validation, known as the "small sample problem". Various attempts to resolve this problem include virtual sample generation (VSG). This study aimed to increase the accuracy of GPR for predicting water quality in situations of limited datasets. Three VSG methods, namely Multi Distribution Mega-Trend Diffusion (MD-MTD), Generative Adversarial Network (GAN), and t-distributed stochastic nearest neighbor embedding (t-SNE) were compared for enhancing the accuracy of GPR model prediction of Strontium (Sr). The models were used to predict Sr in the shallow aquifer system in Songyuan, Jilin Province. The results showed that t-SNE provided the most significant improvement to the accuracy of the GPR, with R increasing from 0.86 to 0.99 (12.98 %), followed by MD-MTD (R of 0.95, 9.39 %), with the least improvement obtained by GAN (R of 0.92, 5.98 %). Boxplots show that MD-MTD-GPR predictions do not fully capture observed data distributions. GANs accurately replicate the data distribution, while t-SNE-GPR achieves the highest prediction accuracy and handles data fluctuations. GPR accuracy improves with an increasing number of virtual samples but tends to decrease when the number exceeds 258 in this study. This study can guide the improvement of the accuracy of GPR for situations of limited datasets. The results of this study can help improve water quality management and drinking water safety in regions with sparse monitoring data.
近年来,人类活动导致的地下水污染日益严重,对饮用水安全构成威胁。虽然高斯过程回归(GPR)是一种预测和监测水质的强大工具,但由于模型训练和验证数据的局限性,即“小样本问题”,其有效性受到限制。为了解决这个问题,人们提出了各种方法,包括虚拟样本生成(VSG)。本研究旨在提高 GPR 在有限数据集情况下预测水质的准确性。本研究比较了三种 VSG 方法,即多分布巨型趋势扩散(MD-MTD)、生成对抗网络(GAN)和 t 分布随机近邻嵌入(t-SNE),以提高 GPR 模型预测锶(Sr)的准确性。模型用于预测吉林省松原浅层含水层系统中的 Sr。结果表明,t-SNE 对 GPR 准确性的提高最为显著,R 值从 0.86 增加到 0.99(12.98%),其次是 MD-MTD(R 值为 0.95,9.39%),而 GAN 的提高最小(R 值为 0.92,5.98%)。箱线图显示,MD-MTD-GPR 的预测并未完全捕获观测数据分布。GAN 准确地复制了数据分布,而 t-SNE-GPR 则实现了最高的预测准确性并处理了数据波动。随着虚拟样本数量的增加,GPR 的准确性会提高,但在本研究中,当数量超过 258 时,准确性会下降。本研究可以指导提高 GPR 在有限数据集情况下的准确性。本研究的结果可以帮助改善数据稀疏监测地区的水质管理和饮用水安全。