Casella Monica, Milano Nicola, Dolce Pasquale, Marocco Davide
Natural and Artificial Cognition Laboratory, Department of Humanistic Studies, University of Naples "Federico II", Naples, Italy.
Department of Translational Medical Science, University of Naples "Federico II", Naples, Italy.
Front Psychol. 2024 Dec 17;15:1449272. doi: 10.3389/fpsyg.2024.1449272. eCollection 2024.
Missing data in psychometric research presents a substantial challenge, impacting the reliability and validity of study outcomes. Various factors contribute to this issue, including participant non-response, dropout, or technical errors during data collection. Traditional methods like mean imputation or regression, commonly used to handle missing data, rely upon assumptions that may not hold on psychological data and can lead to distorted results.
This study aims to evaluate the effectiveness of transformer-based deep learning for missing data imputation, comparing ReMasker, a masking autoencoding transformer model, with conventional imputation techniques (mean and median imputation, Expectation-Maximization algorithm) and machine learning approaches (K-nearest neighbors, MissForest, and an Artificial Neural Network). A psychometric dataset from the COVID distress repository was used, with imputation performance assessed through the Root Mean Squared Error (RMSE) between the original and imputed data matrices.
Results indicate that machine learning techniques, particularly ReMasker, achieve superior performance in terms of reconstruction error compared to conventional imputation techniques across all tested scenarios.
This finding underscores the potential of transformer-based models to provide robust imputation in psychometric research, enhancing data integrity and generalizability.
心理测量学研究中的缺失数据带来了重大挑战,影响研究结果的可靠性和有效性。导致这个问题的因素有很多,包括参与者无回应、退出或数据收集过程中的技术错误。像均值插补或回归这样的传统方法,常用于处理缺失数据,它们依赖的假设可能不适用于心理数据,并且可能导致结果失真。
本研究旨在评估基于Transformer的深度学习在缺失数据插补方面的有效性,将掩码自动编码Transformer模型ReMasker与传统插补技术(均值和中位数插补、期望最大化算法)以及机器学习方法(K近邻、MissForest和人工神经网络)进行比较。使用了来自COVID困扰库的心理测量数据集,通过原始数据矩阵和插补后数据矩阵之间的均方根误差(RMSE)来评估插补性能。
结果表明,在所有测试场景中,与传统插补技术相比,机器学习技术,特别是ReMasker,在重构误差方面表现更优。
这一发现强调了基于Transformer的模型在心理测量学研究中提供强大插补的潜力,增强了数据完整性和可推广性。