Song Yang, Shen Chunqi, Hong Yi
Cooperative Institute for Great Lakes Research, School for Environment and Sustainability, University of Michigan, Ann Arbor, MI, 48109, United States.
Yale School of Environment, Yale University, New Haven, CT, 06511, United States.
J Environ Manage. 2025 Apr;380:125007. doi: 10.1016/j.jenvman.2025.125007. Epub 2025 Mar 17.
Algal blooms, which have substantial adverse effects, are increasingly occurring worldwide in the context of global warming and eutrophication. Machine learning models (MLMs) are emerging as efficient and promising tools for predicting algal blooms. However, the performance of MLMs in directly simulating algal blooms has seldom been reported, particularly in the world's largest freshwater system, the Great Lakes. To address this gap, we compared the prediction performance of Chlorophyll a (Chl a, a proxy for algal biomass) concentration in western Lake Erie among 10 popular MLMs using 15 measured water quality data collected from 2012 to 2022. Results have shown that outlier removal is essential, as it can noticeably improve prediction accuracy such as increasing the coefficient of determination (R) from 0.35 to 0.84 (140 %) for the optimal Gradient Boosting Decision Trees (GBDT) model. All 32,767 feature combinations of measured water quality parameters were exhaustively tested for each MLM and the best feature combinations are identified. MLMs benefit from this feature selection, with the Polynomial Regression model showing notable improvements: the R increased from 0.71 to 0.82 (15 %) compared to no feature selection. The tree-based ensemble models, including the GBDT (R = 0.84) and Random Forest (R = 0.82) models, show the top two performances in predicting Chl a. Based on feature importance analysis, particulate organic nitrogen (PON) is determined to be the most critical water quality parameter for predicting Chl a. These results establish a benchmark for the performance of common MLMs in predicting Chl a in western Lake Erie. The determined best feature combinations potentially make water quality observations more effective and targeted, thereby benefiting sustainable water quality management.
在全球变暖和富营养化的背景下,具有重大负面影响的藻华在全球范围内日益频繁发生。机器学习模型(MLMs)正成为预测藻华的高效且有前景的工具。然而,MLMs在直接模拟藻华方面的性能很少被报道,特别是在世界上最大的淡水系统——五大湖。为了填补这一空白,我们使用2012年至2022年收集的15个实测水质数据,比较了10种常用MLMs对伊利湖西部叶绿素a(Chl a,藻类生物量的替代指标)浓度的预测性能。结果表明,去除异常值至关重要,因为它可以显著提高预测准确性,例如对于最优的梯度提升决策树(GBDT)模型,决定系数(R)从0.35提高到0.84(提高了140%)。对每个MLM详尽测试了实测水质参数的所有32767种特征组合,并确定了最佳特征组合。MLMs受益于这种特征选择,多项式回归模型有显著改进:与未进行特征选择相比,R从0.71提高到0.82(提高了15%)。基于树的集成模型,包括GBDT(R = 0.84)和随机森林(R = 0.82)模型,在预测Chl a方面表现最为突出。基于特征重要性分析,确定颗粒有机氮(PON)是预测Chl a最关键的水质参数。这些结果为常用MLMs在预测伊利湖西部Chl a方面的性能建立了基准。所确定的最佳特征组合可能使水质观测更有效、更具针对性,从而有利于可持续的水质管理。