Wang Peng, Li Yong, Yang Liang, Li Simin, Li Linfeng, Zhao Zehan, Long Shaopei, Wang Fei, Wang Hongqian, Li Ying, Wang Chengliang
College of Computer Science, Chongqing University, Chongqing, China.
School of Computer Science, South China Normal University, Guangzhou, China.
JMIR Med Inform. 2022 Aug 30;10(8):e38154. doi: 10.2196/38154.
With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning-based, or deep learning-based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language.
This paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification.
We propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records.
We compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods.
Compared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.
随着电子健康记录在中国的普及,数字化数据的利用对于真实世界医学研究的发展具有巨大潜力。然而,这些数据通常包含大量受保护的健康信息,直接使用这些数据可能会导致隐私问题。对电子健康记录中的受保护健康信息进行去识别化处理的任务可被视为一个命名实体识别问题。已经提出了基于规则、基于机器学习或基于深度学习的方法来解决这个问题。然而,这些方法仍然面临中文电子健康记录数据不足以及中文语言特征复杂的困难。
本文提出一种方法,以克服深度神经网络的过拟合和训练数据不足的困难,从而实现中文受保护健康信息的去识别化。
我们提出一种新模型,该模型将TinyBERT(基于变换器的双向编码器表征)作为文本特征提取模块,并将条件随机场方法作为预测模块,用于对中文医学电子健康记录中的受保护健康信息进行去识别化。此外,还提出了一种混合数据增强方法,该方法整合了句子生成策略和提及替换策略,以克服中文电子健康记录不足的问题。
我们将我们的方法与5种使用不同BERT模型作为其特征提取模块的基线方法进行了比较。在我们收集的中文电子健康记录上的实验结果表明,我们的方法比所有基于BERT的基线方法具有更好的性能(微精度:98.7%,微召回率:99.13%,微F1分数:98.91%)和更高的效率(快40%)。
与基线方法相比,TinyBERT在我们提出的增强数据集上的效率优势得以保持,同时在中文受保护健康信息去识别化任务上性能有所提高。