Yoo Hakje, Moon Jose, Kim Jong-Ho, Joo Hyung Joon
Korea University Research Institute for Medical Bigdata Science, Korea University College of Medicine, Seongbuk-gu, Seoul, Republic of Korea.
Department of Bio-Mechatronic Engineering, Sungkyunkwan University College of Biotechnology and Bioengineering, Jangan-gu, Suwon, Gyeonggi Republic of Korea.
Health Inf Sci Syst. 2023 Aug 30;11(1):41. doi: 10.1007/s13755-023-00241-y. eCollection 2023 Dec.
The purpose of this study is to construct a synthetic dataset of ECG signal that overcomes the sensitivity of personal information and the complexity of disclosure policies.
The public dataset was constructed by generating synthetic data based on the deep learning model using a convolution neural network (CNN) and bi-directional long short-term memory (Bi-LSTM), and the effectiveness of the dataset was verified by developing classification models for ECG diagnoses.
The synthetic 12-lead ECG dataset generated consists of a total of 6000 ECGs, with normal and 5 abnormal groups. The synthetic ECG signal has a waveform pattern similar to the original ECG signal, the average RMSE between the two signals is 0.042 µV, and the average cosine similarity is 0.993. In addition, five classification models were developed to verify the effect of the synthetic dataset and showed performance similar to that of the model made with the actual dataset. In particular, even when the real dataset was applied as a test set to the classification model trained with the synthetic dataset, the classification performance of all models showed high accuracy (average accuracy 93.41%).
The synthetic 12-lead ECG dataset was confirmed to perform similarly to the real-world 12-lead ECG in the classification model. This implies that a synthetic dataset can perform similarly to a real dataset in clinical research using AI. The synthetic dataset generation process in this study provides a way to overcome the medical data disclosure challenges constrained by privacy rights, a way to encourage open data policies, and contribute significantly to promoting cardiovascular disease research.
本研究的目的是构建一个克服个人信息敏感性和披露政策复杂性的心电图信号合成数据集。
通过使用卷积神经网络(CNN)和双向长短期记忆网络(Bi-LSTM)基于深度学习模型生成合成数据来构建公共数据集,并通过开发用于心电图诊断的分类模型来验证该数据集的有效性。
生成的合成12导联心电图数据集共有6000份心电图,分为正常组和5个异常组。合成心电图信号具有与原始心电图信号相似的波形模式,两个信号之间的平均均方根误差(RMSE)为0.042微伏,平均余弦相似度为0.993。此外,开发了五个分类模型来验证合成数据集的效果,其表现与使用实际数据集构建的模型相似。特别是,当将真实数据集作为测试集应用于用合成数据集训练的分类模型时,所有模型的分类性能都显示出很高的准确性(平均准确率93.41%)。
合成12导联心电图数据集在分类模型中的表现被证实与真实世界的12导联心电图相似。这意味着在使用人工智能的临床研究中,合成数据集可以与真实数据集表现相似。本研究中的合成数据集生成过程提供了一种克服受隐私权限制的医学数据披露挑战的方法,一种鼓励开放数据政策的方法,并为促进心血管疾病研究做出了重大贡献。