de Souza Valter Cesar, Rodrigues Sergio Augusto, Filho Luís Roberto Almeida Gabriel
São Paulo State University (Unesp), School of Agriculture, Botucatu, São Paulo, Brasil.
São Paulo State University (Unesp), School of Sciences and Engineering, Tupã, São Paulo, Brasil.
PLoS One. 2024 Dec 31;19(12):e0315574. doi: 10.1371/journal.pone.0315574. eCollection 2024.
Meteorological data acquired with precision, quality, and reliability are crucial in various agronomy fields, especially in studies related to reference evapotranspiration (ETo). ETo plays a fundamental role in the hydrological cycle, irrigation system planning and management, water demand modeling, water stress monitoring, water balance estimation, as well as in hydrological and environmental studies. However, temporal records often encounter issues such as missing measurements. The aim of this study was to evaluate the performance of alternative multivariate procedures for principal component analysis (PCA), using the Nonlinear Iterative Partial Least Squares (NIPALS) and Expectation-Maximization (EM) algorithms, for imputing missing data in time series of meteorological variables. This was carried out on high-dimensional and reduced-sample databases, covering different percentages of missing data. The databases, collected between 2011 and 2021, originated from 45 automatic weather stations in the São Paulo region, Brazil. They were used to create a daily time series of ETo. Five scenarios of missing data (10%, 20%, 30%, 40%, 50%) were simulated, in which datasets were randomly withdrawn from the ETo base. Subsequently, imputation was performed using the NIPALS-PCA, EM-PCA, and simple mean imputation (IM) procedures. This cycle was repeated 100 times, and average performance indicators were calculated. Statistical performance evaluation utilized the following indicators: correlation coefficient (r), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Square Error (MSE), Normalized Root Mean Square Error (nRMSE), Willmott Index (d), and performance index (c). In the scenario with 10% missing data, NIPALS-PCA achieved the lowest MAPE (15.4%), followed by EM-PCA (17.0%), while IM recorded a MAPE of 24.7%. In the scenario with 50% missing data, there was a performance reversal, with EM-PCA showing the lowest MAPE (19.1%), followed by NIPALS-PCA (19.9%). The NIPALS-PCA and EM-PCA approaches demonstrated good results in imputation (10% ≤ nRMSE < 20%), with NIPALS-PCA excelling in the 10%, 20%, and 30% scenarios, and EM-PCA in the 40% and 50% scenarios. Based on statistical evaluation, the NIPALS-PCA, EM-PCA, and IM imputation models proved suitable for estimating missing ETo data, with PCA imputation models in the NIPALS and EM algorithms showing the most promise. Future research should explore the effectiveness of various imputation methods in diverse climatic and geographical contexts, as well as develop new techniques considering the temporal and spatial structure of meteorological data, to advance understanding and climate prediction.
精确、高质量且可靠地获取的气象数据在各个农学领域至关重要,尤其是在与参考作物蒸散量(ETo)相关的研究中。ETo在水文循环、灌溉系统规划与管理、需水建模、水分胁迫监测、水平衡估算以及水文和环境研究中起着基础性作用。然而,时间记录常常遇到诸如测量缺失等问题。本研究的目的是评估使用非线性迭代偏最小二乘法(NIPALS)和期望最大化(EM)算法的主成分分析(PCA)替代多元程序在估算气象变量时间序列中缺失数据方面的性能。这是在高维和样本量减少的数据库上进行的,这些数据库涵盖了不同百分比的缺失数据。这些在2011年至2021年期间收集的数据库源自巴西圣保罗地区的45个自动气象站。它们被用于创建ETo的每日时间序列。模拟了五种缺失数据情况(10%、20%、30%、40%、50%),其中数据集是从ETo数据库中随机抽取的。随后,使用NIPALS - PCA、EM - PCA和简单均值插补(IM)程序进行插补。这个循环重复100次,并计算平均性能指标。统计性能评估使用了以下指标:相关系数(r)、平均绝对误差(MAE)、平均绝对百分比误差(MAPE)、均方误差(MSE)、归一化均方根误差(nRMSE)、威尔莫特指数(d)和性能指数(c)。在缺失数据为10%的情况下,NIPALS - PCA的MAPE最低(15.4%),其次是EM - PCA(17.0%),而IM的MAPE为24.7%。在缺失数据为50%的情况下,性能出现反转,EM - PCA的MAPE最低(19.1%),其次是NIPALS - PCA(19.9%)。NIPALS - PCA和EM - PCA方法在插补方面显示出良好的结果(10%≤nRMSE<20%),NIPALS - PCA在10%、20%和30%的情况下表现出色,EM - PCA在40%和50%的情况下表现出色。基于统计评估,NIPALS - PCA、EM - PCA和IM插补模型被证明适用于估算缺失的ETo数据,NIPALS和EM算法中的PCA插补模型显示出最大的潜力。未来的研究应探索各种插补方法在不同气候和地理背景下的有效性,以及开发考虑气象数据时空结构的新技术,以增进理解和气候预测。