Song Gyuseon, Chung Su Jin, Seo Ji Yeon, Yang Sun Young, Jin Eun Hyo, Chung Goh Eun, Shim Sung Ryul, Sa Soonok, Hong Moongi Simon, Kim Kang Hyun, Jang Eunchan, Lee Chae Won, Bae Jung Ho, Han Hyun Wook
Department of Biomedical Informatics, CHA University School of Medicine, CHA University, Seongnam 13488, Korea.
Institute for Biomedical Informatics, CHA University School of Medicine, CHA University, Seongnam 13488, Korea.
J Clin Med. 2022 May 24;11(11):2967. doi: 10.3390/jcm11112967.
: The utility of clinical information from esophagogastroduodenoscopy (EGD) reports has been limited because of its unstructured narrative format. We developed a natural language processing (NLP) pipeline that automatically extracts information about gastric diseases from unstructured EGD reports and demonstrated its applicability in clinical research. An NLP pipeline was developed using 2000 EGD and associated pathology reports that were retrieved from a single healthcare center. The pipeline extracted clinical information, including the presence, location, and size, for 10 gastric diseases from the EGD reports. It was validated with 1000 EGD reports by evaluating sensitivity, positive predictive value (PPV), accuracy, and F1 score. The pipeline was applied to 248,966 EGD reports from 2010-2019 to identify patient demographics and clinical information for 10 gastric diseases. For gastritis information extraction, we achieved an overall sensitivity, PPV, accuracy, and F1 score of 0.966, 0.972, 0.996, and 0.967, respectively. Other gastric diseases, such as ulcers, and neoplastic diseases achieved an overall sensitivity, PPV, accuracy, and F1 score of 0.975, 0.982, 0.999, and 0.978, respectively. The study of EGD data of over 10 years revealed the demographics of patients with gastric diseases by sex and age. In addition, the study identified the extent and locations of gastritis and other gastric diseases, respectively. We demonstrated the feasibility of the NLP pipeline providing an automated extraction of gastric disease information from EGD reports. Incorporating the pipeline can facilitate large-scale clinical research to better understand gastric diseases.
由于食管胃十二指肠镜检查(EGD)报告采用非结构化叙述格式,其临床信息的实用性受到限制。我们开发了一种自然语言处理(NLP)管道,可从非结构化的EGD报告中自动提取有关胃部疾病的信息,并证明了其在临床研究中的适用性。使用从单个医疗中心检索到的2000份EGD及相关病理报告开发了该NLP管道。该管道从EGD报告中提取了10种胃部疾病的临床信息,包括疾病的存在、位置和大小。通过评估敏感性、阳性预测值(PPV)、准确性和F1分数,用1000份EGD报告对其进行了验证。将该管道应用于2010年至2019年的248,966份EGD报告,以识别10种胃部疾病的患者人口统计学和临床信息。对于胃炎信息提取,我们分别实现了0.966、0.972、0.996和0.967的总体敏感性、PPV、准确性和F1分数。其他胃部疾病,如溃疡和肿瘤性疾病,分别实现了0.975、0.982、0.999和0.978的总体敏感性、PPV、准确性和F1分数。对超过10年的EGD数据研究揭示了按性别和年龄划分的胃部疾病患者的人口统计学特征。此外,该研究分别确定了胃炎和其他胃部疾病病变的范围和位置。我们证明了NLP管道从EGD报告中自动提取胃部疾病信息的可行性。采用该管道有助于大规模临床研究,以更好地了解胃部疾病。