Zheng Chengyi, Ackerson Bradley, Qiu Sijia, Sy Lina S, Daily Leticia I Vega, Song Jeannie, Qian Lei, Luo Yi, Ku Jennifer H, Cheng Yanjun, Wu Jun, Tseng Hung Fu
Department of Research & Evaluation, Kaiser Permanente Southern California, 100 S Los Robles Ave, 2nd Floor, Pasadena, CA, 91101, United States, 1 626-986-8665, 1 626-564-7872.
South Bay Medical Center, Kaiser Permanente Southern California, Harbor City, CA, United States.
JMIR Med Inform. 2024 Sep 10;12:e57949. doi: 10.2196/57949.
Diagnosis codes and prescription data are used in algorithms to identify postherpetic neuralgia (PHN), a debilitating complication of herpes zoster (HZ). Because of the questionable accuracy of codes and prescription data, manual chart review is sometimes used to identify PHN in electronic health records (EHRs), which can be costly and time-consuming.
This study aims to develop and validate a natural language processing (NLP) algorithm for automatically identifying PHN from unstructured EHR data and to compare its performance with that of code-based methods.
This retrospective study used EHR data from Kaiser Permanente Southern California, a large integrated health care system that serves over 4.8 million members. The source population included members aged ≥50 years who received an incident HZ diagnosis and accompanying antiviral prescription between 2018 and 2020 and had ≥1 encounter within 90-180 days of the incident HZ diagnosis. The study team manually reviewed the EHR and identified PHN cases. For NLP development and validation, 500 and 800 random samples from the source population were selected, respectively. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F-score, and Matthews correlation coefficient (MCC) of NLP and the code-based methods were evaluated using chart-reviewed results as the reference standard.
The NLP algorithm identified PHN cases with a 90.9% sensitivity, 98.5% specificity, 82% PPV, and 99.3% NPV. The composite scores of the NLP algorithm were 0.89 (F-score) and 0.85 (MCC). The prevalences of PHN in the validation data were 6.9% (reference standard), 7.6% (NLP), and 5.4%-13.1% (code-based). The code-based methods achieved a 52.7%-61.8% sensitivity, 89.8%-98.4% specificity, 27.6%-72.1% PPV, and 96.3%-97.1% NPV. The F-scores and MCCs ranged between 0.45 and 0.59 and between 0.32 and 0.61, respectively.
The automated NLP-based approach identified PHN cases from the EHR with good accuracy. This method could be useful in population-based PHN research.
诊断编码和处方数据被用于算法中以识别带状疱疹后神经痛(PHN),这是带状疱疹(HZ)一种使人衰弱的并发症。由于编码和处方数据的准确性存疑,有时会采用人工病历审查来在电子健康记录(EHR)中识别PHN,这可能成本高昂且耗时。
本研究旨在开发并验证一种自然语言处理(NLP)算法,用于从非结构化的EHR数据中自动识别PHN,并将其性能与基于编码的方法进行比较。
这项回顾性研究使用了南加州凯撒医疗集团的EHR数据,该集团是一个大型综合医疗系统,服务超过480万会员。源人群包括年龄≥50岁、在2018年至2020年期间接受过首次HZ诊断及相应抗病毒处方且在首次HZ诊断后90 - 180天内有≥1次就诊的会员。研究团队人工审查了EHR并识别出PHN病例。为了进行NLP开发和验证,分别从源人群中随机抽取了500个和800个样本。以病历审查结果作为参考标准,评估NLP和基于编码的方法的灵敏度、特异度、阳性预测值(PPV)、阴性预测值(NPV)、F分数和马修斯相关系数(MCC)。
NLP算法识别PHN病例的灵敏度为90.9%,特异度为98.5%,PPV为82%,NPV为99.3%。NLP算法的综合评分为0.89(F分数)和0.85(MCC)。验证数据中PHN的患病率分别为6.9%(参考标准)、7.6%(NLP)和5.4% - 13.1%(基于编码的方法)。基于编码的方法的灵敏度为52.7% - 61.8%,特异度为89.8% - 98.4%,PPV为27.6% - 72.1%,NPV为96.3% - 97.1%。F分数和MCC分别在0.45至0.59以及0.32至0.61之间。
基于NLP的自动化方法能以较高的准确性从EHR中识别出PHN病例。该方法可能对基于人群的PHN研究有用。