Zhang Yu, Wang Xuwen, Hou Zhen, Li Jiao
Institute of Medical Information and Library, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China.
JMIR Med Inform. 2018 Dec 17;6(4):e50. doi: 10.2196/medinform.9965.
BACKGROUND: Electronic health records (EHRs) are important data resources for clinical studies and applications. Physicians or clinicians describe patients' disorders or treatment procedures in EHRs using free text (unstructured) clinical notes. The narrative information plays an important role in patient treatment and clinical research. However, it is challenging to make machines understand the clinical narratives. OBJECTIVE: This study aimed to automatically identify Chinese clinical entities from free text in EHRs and make machines semantically understand diagnoses, tests, body parts, symptoms, treatments, and so on. METHODS: The dataset we used for this study is the benchmark dataset with human annotated Chinese EHRs, released by the China Conference on Knowledge Graph and Semantic Computing 2017 clinical named entity recognition challenge task. Overall, 2 machine learning models, the conditional random fields (CRF) method and bidirectional long short-term memory (LSTM)-CRF, were applied to recognize clinical entities from Chinese EHR data. To train the CRF-based model, we selected features such as bag of Chinese characters, part-of-speech tags, character types, and the position of characters. For the bidirectional LSTM-CRF-based model, character embeddings and segmentation information were used as features. In addition, we also employed a dictionary-based approach as the baseline for the purpose of performance evaluation. Precision, recall, and the harmonic average of precision and recall (F1 score) were used to evaluate the performance of the methods. RESULTS: Experiments on the test set showed that our methods were able to automatically identify types of Chinese clinical entities such as diagnosis, test, symptom, body part, and treatment simultaneously. With regard to overall performance, CRF and bidirectional LSTM-CRF achieved a precision of 0.9203 and 0.9112, recall of 0.8709 and 0.8974, and F1 score of 0.8949 and 0.9043, respectively. The results also indicated that our methods performed well in recognizing each type of clinical entity, in which the "symptom" type achieved the best F1 score of over 0.96. Moreover, as the number of features increased, the F1 score of the CRF model increased from 0.8547 to 0.8949. CONCLUSIONS: In this study, we employed two computational methods to simultaneously identify types of Chinese clinical entities from free text in EHRs. With training, these methods can effectively identify various types of clinical entities (eg, symptom and treatment) with high accuracy. The deep learning model, bidirectional LSTM-CRF, can achieve better performance than the CRF model with little feature engineering. This study contributed to translating human-readable health information into machine-readable information.
背景:电子健康记录(EHRs)是临床研究和应用的重要数据资源。医生或临床医生使用自由文本(非结构化)临床笔记在EHRs中描述患者的病症或治疗过程。叙述性信息在患者治疗和临床研究中起着重要作用。然而,让机器理解临床叙述具有挑战性。 目的:本研究旨在从EHRs中的自由文本中自动识别中文临床实体,并使机器在语义上理解诊断、检查、身体部位、症状、治疗等。 方法:我们用于本研究的数据集是由2017年中国知识图谱与语义计算会议临床命名实体识别挑战任务发布的带有人工标注中文EHRs的基准数据集。总体而言,应用了2种机器学习模型,即条件随机场(CRF)方法和双向长短期记忆(LSTM)-CRF,从中文EHR数据中识别临床实体。为了训练基于CRF的模型,我们选择了诸如汉字袋、词性标签、字符类型和字符位置等特征。对于基于双向LSTM-CRF的模型,字符嵌入和分词信息被用作特征。此外,我们还采用了基于字典的方法作为性能评估的基线。精确率、召回率以及精确率和召回率的调和平均值(F1分数)用于评估这些方法的性能。 结果:在测试集上的实验表明,我们的方法能够同时自动识别中文临床实体的类型,如诊断、检查、症状、身体部位和治疗。在整体性能方面,CRF和双向LSTM-CRF的精确率分别为0.9203和0.9112,召回率分别为0.8709和0.8974,F1分数分别为0.8949和0.9043。结果还表明,我们的方法在识别每种临床实体类型方面表现良好,其中“症状”类型的F1分数最高,超过0.96。此外,随着特征数量的增加,CRF模型的F1分数从0.8547提高到0.8949。 结论:在本研究中,我们采用了两种计算方法从EHRs中的自由文本中同时识别中文临床实体的类型。经过训练,这些方法能够有效地高精度识别各种类型的临床实体(如症状和治疗)。深度学习模型双向LSTM-CRF在几乎没有特征工程的情况下比CRF模型能取得更好的性能。本研究有助于将人类可读的健康信息转化为机器可读信息。
JMIR Med Inform. 2018-12-17
BMC Med Inform Decis Mak. 2022-3-23
Int J Environ Res Public Health. 2020-4-14
J Biomed Semantics. 2020-9-21
Int J Environ Res Public Health. 2020-3-2
Perspect Health Inf Manag. 2024-6-1
Sensors (Basel). 2025-2-19
BMC Med Inform Decis Mak. 2023-10-10
Med Biol Eng Comput. 2023-10
BMC Bioinformatics. 2023-6-26
BMC Med Inform Decis Mak. 2022-3-23
J Healthc Eng. 2018-4-18
J Healthc Eng. 2017-7-5
Bioinformatics. 2017-7-15
BMC Med Inform Decis Mak. 2017-7-5
J Biomed Inform. 2017-6
Proc Conf Empir Methods Nat Lang Process. 2016-11
AMIA Annu Symp Proc. 2015-11-5
Stud Health Technol Inform. 2015