Suppr超能文献

生成对抗网络在大数据临床研究中用于填补缺失数据。

Generative adversarial networks for imputing missing data for big data clinical research.

机构信息

Department of Family Medicine and Primary Care, Faculty of Medicine, University of Hong Kong, Hong Kong, Hong Kong SAR, China.

School of Nursing, Faculty of Medicine, University of Hong Kong, Hong Kong, Hong Kong SAR, China.

出版信息

BMC Med Res Methodol. 2021 Apr 20;21(1):78. doi: 10.1186/s12874-021-01272-3.

Abstract

BACKGROUND

Missing data is a pervasive problem in clinical research. Generative adversarial imputation nets (GAIN), a novel machine learning data imputation approach, has the potential to substitute missing data accurately and efficiently but has not yet been evaluated in empirical big clinical datasets.

OBJECTIVES

This study aimed to evaluate the accuracy of GAIN in imputing missing values in large real-world clinical datasets with mixed-type variables. The computation efficiency of GAIN was also evaluated. The performance of GAIN was compared with other commonly used methods, MICE and missForest.

METHODS

Two real world clinical datasets were used. The first was that of a cohort study on the long-term outcomes of patients with diabetes (50,000 complete cases), and the second was of a cohort study on the effectiveness of a risk assessment and management programme for patients with hypertension (10,000 complete cases). Missing data (missing at random) to independent variables were simulated at different missingness rates (20, 50%). The normalized root mean square error (NRMSE) between imputed values and real values for continuous variables and the proportion of falsely classified (PFC) for categorical variables were used to measure imputation accuracy. Computation time per imputation for each method was recorded. The differences in accuracy of different imputation methods were compared using ANOVA or non-parametric test.

RESULTS

Both missForest and GAIN were more accurate than MICE. GAIN showed similar accuracy as missForest when the simulated missingness rate was 20%, but was more accurate when the simulated missingness rate was 50%. GAIN was the most accurate for the imputation of skewed continuous and imbalanced categorical variables at both missingness rates. GAIN had a much higher computation speed (32 min on PC) comparing to that of missForest (1300 min) when the sample size is 50,000.

CONCLUSION

GAIN showed better accuracy as an imputation method for missing data in large real-world clinical datasets compared to MICE and missForest, and was more resistant to high missingness rate (50%). The high computation speed is an added advantage of GAIN in big clinical data research. It holds potential as an accurate and efficient method for missing data imputation in future big data clinical research.

TRIAL REGISTRATION

ClinicalTrials.gov ID: NCT03299010 ; Unique Protocol ID: HKUCTR-2232.

摘要

背景

缺失数据是临床研究中普遍存在的问题。生成对抗网络(GAIN)是一种新的机器学习数据插补方法,它有可能准确高效地替代缺失数据,但尚未在经验丰富的大型临床数据集进行评估。

目的

本研究旨在评估 GAIN 在插补混合变量的大型真实临床数据集缺失值方面的准确性。还评估了 GAIN 的计算效率。将 GAIN 的性能与其他常用方法(MICE 和 missForest)进行了比较。

方法

使用了两个真实的临床数据集。第一个是糖尿病患者长期预后的队列研究(50000 个完整病例),第二个是高血压风险评估和管理计划有效性的队列研究(10000 个完整病例)。对自变量缺失数据(随机缺失)以不同缺失率(20%、50%)进行模拟。对于连续变量,使用插补值与真实值之间的归一化均方根误差(NRMSE)和分类变量的错误分类比例(PFC)来衡量插补准确性。记录每种方法每次插补的计算时间。使用方差分析或非参数检验比较不同插补方法的准确性差异。

结果

missForest 和 GAIN 均比 MICE 更准确。当模拟缺失率为 20%时,GAIN 的准确性与 missForest 相似,但当模拟缺失率为 50%时,GAIN 的准确性更高。在两种缺失率下,GAIN 对偏态连续和不平衡分类变量的插补最准确。当样本量为 50000 时,GAIN 的计算速度(PC 上 32 分钟)比 missForest(1300 分钟)快得多。

结论

与 MICE 和 missForest 相比,GAIN 作为一种在大型真实临床数据集缺失数据插补方法具有更高的准确性,并且对高缺失率(50%)更具抵抗力。高计算速度是 GAIN 在大型临床数据研究中的一个优势。它有望成为未来大数据临床研究中一种准确高效的缺失数据插补方法。

试验注册

ClinicalTrials.gov ID:NCT03299010;独特方案 ID:HKUCTR-2232。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe89/8059005/34550ea6bc18/12874_2021_1272_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验