Ghosheh Ghadeer O, Thwaites C Louise, Zhu Tingting
Department of Engineering Sciences, University of Oxford, Oxford OX1 3PJ, UK.
Oxford University Clinical Research Unit (OUCRU), Ho Chi Minh City 710400, Vietnam.
Biomedicines. 2023 Jun 18;11(6):1749. doi: 10.3390/biomedicines11061749.
The spread of machine learning models, coupled with by the growing adoption of electronic health records (EHRs), has opened the door for developing clinical decision support systems. However, despite the great promise of machine learning for healthcare in low-middle-income countries (LMICs), many data-specific limitations, such as the small size and irregular sampling, hinder the progress in such applications. Recently, deep generative models have been proposed to generate realistic-looking synthetic data, including EHRs, by learning the underlying data distribution without compromising patient privacy. In this study, we first use a deep generative model to generate synthetic data based on a small dataset (364 patients) from a LMIC setting. Next, we use synthetic data to build models that predict the onset of hospital-acquired infections based on minimal information collected at patient ICU admission. The performance of the diagnostic model trained on the synthetic data outperformed models trained on the original and oversampled data using techniques such as SMOTE. We also experiment with varying the size of the synthetic data and observe the impact on the performance and interpretability of the models. Our results show the promise of using deep generative models in enabling healthcare data owners to develop and validate models that serve their needs and applications, despite limitations in dataset size.
机器学习模型的传播,再加上电子健康记录(EHR)的日益普及,为临床决策支持系统的开发打开了大门。然而,尽管机器学习在中低收入国家(LMIC)的医疗保健领域前景广阔,但许多特定于数据的限制,如规模小和采样不规则,阻碍了此类应用的进展。最近,有人提出了深度生成模型,通过学习潜在的数据分布来生成逼真的合成数据,包括电子健康记录,同时不损害患者隐私。在本研究中,我们首先使用深度生成模型,基于来自中低收入国家环境的一个小数据集(364名患者)生成合成数据。接下来,我们使用合成数据构建模型,这些模型根据患者入住重症监护病房时收集的最少信息来预测医院获得性感染的发生。在合成数据上训练的诊断模型的性能优于使用SMOTE等技术在原始数据和过采样数据上训练的模型。我们还对合成数据的大小进行了变化实验,并观察其对模型性能和可解释性的影响。我们的结果表明,尽管数据集规模有限,但使用深度生成模型有望使医疗保健数据所有者开发和验证满足其需求和应用的模型。