Hui Harvard Wai Hann, Chan Wei Xin, Goh Wilson Wen Bin
Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore.
Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore.
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf168.
Batch effect associated missing values (BEAMs) are batch-wide missingness induced from the integration of data with different coverage of biomedical features. BEAMs can present substantial challenges in data analysis. This study investigates how BEAMs impact missing value imputation (MVI) and batch effect (BE) correction algorithms (BECAs). Through simulations and analyses of real-world datasets including the Clinical Proteomic Tumour Analysis Consortium (CPTAC), we evaluated six MVI methods: K-nearest neighbors (KNN), Mean, MinProb, Singular Value Decomposition (SVD), Multivariate Imputation by Chained Equations (MICE), and Random Forest (RF), with ComBat and limma as the BECAs. We demonstrated that BEAMs strongly affect MVI performance, resulting in inaccurate imputed values, inflated significant P-values, and compromised BE correction. KNN, SVD, and RF were particularly prone to propagating random signals, resulting in false statistical confidence. While imputation with Mean and MinProb were less detrimental, artifacts were nonetheless introduced. Furthermore, the detrimental effect of BEAMs increased in parallel with its severity in the data. Our findings highlight the necessity of comprehensive assessments and tailored strategies to handle BEAMs in multi-batch datasets to ensure reliable data analysis and interpretation. Future work should investigate more advanced simulations and a variety of dedicated MVI methods to robustly address BEAMs.
批次效应相关的缺失值(BEAMs)是在整合具有不同生物医学特征覆盖范围的数据时所引发的全批次缺失情况。BEAMs会在数据分析中带来重大挑战。本研究调查了BEAMs如何影响缺失值插补(MVI)和批次效应(BE)校正算法(BECAs)。通过对包括临床蛋白质组肿瘤分析联盟(CPTAC)在内的真实世界数据集进行模拟和分析,我们评估了六种MVI方法:K近邻(KNN)、均值法、最小概率法、奇异值分解(SVD)、链式方程多元插补(MICE)和随机森林(RF),并以ComBat和limma作为BECAs。我们证明了BEAMs会强烈影响MVI性能,导致插补值不准确、显著P值膨胀以及BE校正效果受损。KNN、SVD和RF特别容易传播随机信号,从而产生错误的统计置信度。虽然均值法和最小概率法插补的有害影响较小,但仍会引入伪影。此外,BEAMs的有害影响会随着其在数据中的严重程度而平行增加。我们的研究结果凸显了在多批次数据集中全面评估和制定针对性策略以处理BEAMs的必要性,以确保可靠的数据分析和解释。未来的工作应研究更先进的模拟方法和各种专门的MVI方法,以稳健地应对BEAMs。