Unit of Epidemiology and Medical Statistics, Department of Diagnostics and Public Health, University of Verona, Verona, Italy.
School of Aerospace Engineering, Universidad Politécnica de Madrid, Madrid, Spain.
PLoS One. 2024 Nov 19;19(11):e0314005. doi: 10.1371/journal.pone.0314005. eCollection 2024.
Limited research has assessed the accuracy of imputation methods in aerobiological datasets. We conducted a simulation study to evaluate, for the first time, the effectiveness of Gappy Singular Value Decomposition (GSVD), a data-driven approach, comparing it with the moving mean interpolation, a statistical approach. Utilizing complete pollen data from two monitoring stations in northeastern Italy for 2022, we randomly generated missing data considering the combination of various proportions (5%, 10%, 25%) and gap lengths (3, 5, 7, 10 days). We imputed 4800 time series using the GSVD algorithm, specifically implemented for this study, and the moving mean algorithm of the "AeRobiology" R package. We assessed imputation accuracy by calculating the Root Mean Square Error and employed multiple linear regression models to identify factors independently affecting the error (e.g. pollen variability, simulation settings). The results showed that the GSVD was as good as the well-established moving mean method and demonstrated its strong generalization capabilities across different data types. However, the imputation error was primarily influenced by pollen characteristics and location, regardless of the imputation method used. High variability in pollen concentrations and the distribution of missing data negatively affected imputation accuracy. In conclusion, we introduced and tested a novel imputation method, demonstrating comparable performance to the statistical approach in aerobiological data reconstruction. These findings contribute to advancing aerobiological data analysis, highlighting the need for improving imputation methods.
有限的研究评估了在气传花粉数据集中文献插补方法的准确性。我们进行了一项模拟研究,首次评估了数据驱动方法——广义奇异值分解(GSVD)的有效性,将其与统计方法——移动均值插值法进行了比较。利用 2022 年意大利东北部两个监测站的完整花粉数据,我们考虑了各种比例(5%、10%、25%)和缺口长度(3、5、7、10 天)的组合,随机生成缺失数据。我们使用特定于本研究的 GSVD 算法和“气传生物学”R 包的移动均值算法,对 4800 个时间序列进行了插补。我们通过计算均方根误差来评估插补准确性,并使用多元线性回归模型来确定独立影响误差的因素(例如花粉变异性、模拟设置)。结果表明,GSVD 与成熟的移动均值方法一样好,并且在不同数据类型中表现出很强的泛化能力。然而,插补误差主要受到花粉特征和位置的影响,而与所使用的插补方法无关。花粉浓度的高度变异性和缺失数据的分布对插补准确性有负面影响。总之,我们介绍并测试了一种新的插补方法,该方法在气传花粉数据重构方面的性能与统计方法相当。这些发现有助于推进气传花粉数据分析,突出了改进插补方法的必要性。