Life Science College, Central South University, No. 932 South Lushan Road, Changsha, 410083, China.
Institute of Medical Information, Chinese Academy of Medical Sciences, No. 3 Yabao Road, Beijing, 100020, China.
BMC Med Inform Decis Mak. 2022 Mar 23;22(1):72. doi: 10.1186/s12911-022-01810-z.
Pituitary adenomas are the most common type of pituitary disorders, which usually occur in young adults and often affect the patient's physical development, labor capacity and fertility. Clinical free texts noted in electronic medical records (EMRs) of pituitary adenomas patients contain abundant diagnosis and treatment information. However, this information has not been well utilized because of the challenge to extract information from unstructured clinical texts. This study aims to enable machines to intelligently process clinical information, and automatically extract clinical named entity for pituitary adenomas from Chinese EMRs.
The clinical corpus used in this study was from one pituitary adenomas neurosurgery treatment center of a 3A hospital in China. Four types of fine-grained texts of clinical records were selected, which included notes from present illness, past medical history, case characteristics and family history of 500 pituitary adenoma inpatients. The dictionary-based matching, conditional random fields (CRF), bidirectional long short-term memory with CRF (BiLSTM-CRF), and bidirectional encoder representations from transformers with BiLSTM-CRF (BERT-BiLSTM-CRF) were used to extract clinical entities from a Chinese EMRs corpus. A comprehensive dictionary was constructed based on open source vocabularies and a domain dictionary for pituitary adenomas to conduct the dictionary-based matching method. We selected features such as part of speech, radical, document type, and the position of characters to train the CRF-based model. Random character embeddings and the character embeddings pretrained by BERT were used respectively as the input features for the BiLSTM-CRF model and the BERT-BiLSTM-CRF model. Both strict metric and relaxed metric were used to evaluate the performance of these methods.
Experimental results demonstrated that the deep learning and other machine learning methods were able to automatically extract clinical named entities, including symptoms, body regions, diseases, family histories, surgeries, medications, and disease courses of pituitary adenomas from Chinese EMRs. With regard to overall performance, BERT-BiLSTM-CRF has the highest strict F1 value of 91.27% and the highest relaxed F1 value of 95.57% respectively. Additional evaluations showed that BERT-BiLSTM-CRF performed best in almost all entity recognition except surgery and disease course. BiLSTM-CRF performed best in disease course entity recognition, and performed as well as the CRF model for part of speech, radical and document type features, with both strict and relaxed F1 value reaching 96.48%. The CRF model with part of speech, radical and document type features performed best in surgery entity recognition with relaxed F1 value of 95.29%.
In this study, we conducted four entity recognition methods for pituitary adenomas based on Chinese EMRs. It demonstrates that the deep learning methods can effectively extract various types of clinical entities with satisfying performance. This study contributed to the clinical named entity extraction from Chinese neurosurgical EMRs. The findings could also assist in information extraction in other Chinese medical texts.
垂体腺瘤是最常见的垂体疾病类型,通常发生在年轻人中,常常影响患者的身体发育、劳动能力和生育能力。垂体腺瘤患者电子病历(EMR)中的临床自由文本包含丰富的诊断和治疗信息。然而,由于从非结构化临床文本中提取信息具有挑战性,因此这些信息尚未得到充分利用。本研究旨在使机器能够智能地处理临床信息,并自动从中文 EMR 中提取垂体腺瘤的临床命名实体。
本研究使用的临床语料库来自中国一家 3A 医院的垂体腺瘤神经外科治疗中心。选择了四种精细的临床记录类型,包括 500 名垂体腺瘤住院患者的现病史、既往病史、病例特征和家族史记录。基于词典的匹配、条件随机场(CRF)、带有 CRF 的双向长短期记忆(BiLSTM-CRF)和带有 BiLSTM-CRF 的双向编码器表示(BERT-BiLSTM-CRF)被用于从中文 EMR 语料库中提取临床实体。基于开源词汇表和垂体腺瘤领域词典构建了一个综合词典,用于进行基于词典的匹配方法。我们选择了词性、部首、文档类型和字符位置等特征来训练 CRF 模型。分别使用随机字符嵌入和 BERT 预训练的字符嵌入作为 BiLSTM-CRF 模型和 BERT-BiLSTM-CRF 模型的输入特征。使用严格度量和宽松度量来评估这些方法的性能。
实验结果表明,深度学习和其他机器学习方法能够自动从中文 EMR 中提取垂体腺瘤的临床命名实体,包括症状、身体部位、疾病、家族史、手术、药物和疾病病程。就整体性能而言,BERT-BiLSTM-CRF 在严格 F1 值和宽松 F1 值方面的表现均最高,分别为 91.27%和 95.57%。进一步的评估表明,BERT-BiLSTM-CRF 在几乎所有实体识别方面表现最佳,除了手术和疾病病程。BiLSTM-CRF 在疾病病程实体识别方面表现最佳,其词性、部首和文档类型特征的严格和宽松 F1 值均达到 96.48%。CRF 模型在手术实体识别方面表现最佳,其宽松 F1 值为 95.29%。
在本研究中,我们针对中文 EMR 进行了四种垂体腺瘤的实体识别方法。这表明深度学习方法可以有效地提取各种类型的临床实体,具有令人满意的性能。本研究有助于从中文神经外科 EMR 中提取临床命名实体。研究结果还可以协助其他中文医疗文本的信息提取。