McMurry Andrew J, Phelan Dylan, Dixon Brian E, Geva Alon, Gottlieb Daniel, Jones James R, Terry Michael, Taylor David E, Callaway Hannah, Manoharan Sneha, Miller Timothy, Olson Karen L, Mandl Kenneth D
Computational Health Informatics Program, Boston Children's Hospital, 401 Park Drive, LM5506, Mail Stop BCH3187, Boston, MA, 02215, United States, 1 617-355-4145.
Department of Pediatrics, Harvard Medical School, Boston, MA, United States.
J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.
Recognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.
The primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.
Four LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children's Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children's Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)-based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.
Symptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10-based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10-based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10-based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).
LLMs significantly outperformed an ICD-10-based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.
识别患者症状是医学、研究和公共卫生的基础。然而,尽管症状在医生记录中经常有记载,但在编码格式中往往报告不足。以通用性著称的大语言模型(LLMs)可以通过模仿人类专家病历审阅者的角色来帮助弥合这一差距,以进行症状识别。
这项多中心研究的主要目的是使用按照病历审阅指南进行训练的大语言模型来测量对传染性呼吸道疾病症状的准确识别。次要目的是评估大语言模型在多中心环境中的通用性,而无需进行特定地点的培训、微调或定制。
对四个大语言模型进行了评估:GPT-4、GPT-3.5、Llama2 70B和Mixtral 8×7B。大语言模型提示被设定为扮演病历审阅者的角色,并在评估医生记录时遵循症状注释指南。每个记录的真实标签由主题专家进行注释。使用来自波士顿儿童医院急诊科的103份记录的开发语料库选择最佳的大语言模型提示策略。使用来自波士顿儿童医院的202份记录的测试语料库来测量每个大语言模型的性能。还测量了基于国际疾病分类第十版(ICD-10)的方法的性能作为基线。然后在印第安纳州健康信息交换中心21个急诊科的308份记录的验证语料库中测量性能最佳的大语言模型的通用性。
与基于ICD-10的方法相比,在测试的每个大语言模型中,每种传染病症状的症状识别准确率都更高(F1分数=45.1%)。GPT-4得分最高(F1分数=91.4%;P<0.001),并且明显优于基于ICD-10的方法,其次是GPT-3.5(F1分数=90.0%;P<0.001)、Llama2(F1分数=81.7%;P<0.001)和Mixtral(F1分数=83.5%;P<0.001)。对于验证语料库,基于ICD-10的方法的性能下降(F1分数=26.9%),而GPT-4的性能提高(F1分数=94.0%),这表明使用GPT-4具有更好的通用性(P<0.001)。
在急诊科电子健康记录中,大语言模型在呼吸道症状识别方面明显优于基于ICD-10的方法。GPT-4表现出最高的准确性和通用性,这表明大语言模型可能增强或取代传统方法。可以准确地指示大语言模型模仿人类病历审阅者。未来的工作应该评估更广泛的症状类型和医疗保健环境。