Khan Murad Ali
Department of Computer Engineering, Jeju National University, Jeju 63243, Jeju-do, Republic of Korea.
Bioengineering (Basel). 2024 Jul 23;11(8):740. doi: 10.3390/bioengineering11080740.
In clinical datasets, missing data often occur due to various reasons including non-response, data corruption, and errors in data collection or processing. Such missing values can lead to biased statistical analyses, reduced statistical power, and potentially misleading findings, making effective imputation critical. Traditional imputation methods, such as Zero Imputation, Mean Imputation, and k-Nearest Neighbors (KNN) Imputation, attempt to address these gaps. However, these methods often fall short of accurately capturing the underlying data complexity, leading to oversimplified assumptions and errors in prediction. This study introduces a novel Imputation model employing transformer-based architectures to address these challenges. Notably, the model distinguishes between complete EEG signal amplitude data and incomplete data in two datasets: PhysioNet and CHB-MIT. By training exclusively on complete amplitude data, the TabTransformer accurately learns and predicts missing values, capturing intricate patterns and relationships inherent in EEG amplitude data. Evaluation using various error metrics and R2 score demonstrates significant enhancements over traditional methods such as Zero, Mean, and KNN imputation. The Proposed Model achieves impressive R2 scores of 0.993 for PhysioNet and 0.97 for CHB-MIT, highlighting its efficacy in handling complex clinical data patterns and improving dataset integrity. This underscores the transformative potential of transformer models in advancing the utility and reliability of clinical datasets.
在临床数据集中,缺失数据经常由于各种原因出现,包括无应答、数据损坏以及数据收集或处理中的错误。此类缺失值可能导致有偏差的统计分析、统计功效降低以及潜在的误导性结果,因此有效的插补至关重要。传统的插补方法,如零插补、均值插补和k近邻(KNN)插补,试图解决这些缺口。然而,这些方法往往无法准确捕捉潜在的数据复杂性,导致过于简化的假设和预测误差。本研究引入了一种采用基于Transformer架构的新型插补模型来应对这些挑战。值得注意的是,该模型在PhysioNet和CHB - MIT这两个数据集中区分了完整的脑电图信号幅度数据和不完整数据。通过仅在完整幅度数据上进行训练,TabTransformer准确地学习并预测缺失值,捕捉脑电图幅度数据中固有的复杂模式和关系。使用各种误差度量和R2分数进行的评估表明,与零插补、均值插补和KNN插补等传统方法相比有显著改进。所提出的模型在PhysioNet数据集上实现了令人印象深刻的R2分数0.993,在CHB - MIT数据集上实现了0.97,突出了其在处理复杂临床数据模式和提高数据集完整性方面的功效。这强调了Transformer模型在提升临床数据集的实用性和可靠性方面的变革潜力。