Suppr超能文献

在提取炎症性肠病患者报告的结局方面,大型语言模型优于传统的自然语言处理方法。

Large language models outperform traditional natural language processing methods in extracting patient-reported outcomes in IBD.

作者信息

Patel Perseus V, Davis Conner, Ralbovsky Amariel, Tinoco Daniel, Williams Christopher Y K, Slatter Shadera, Naderalvojoud Behzad, Rosen Michael J, Hernandez-Boussard Tina, Rudrapatna Vivek

机构信息

Department of Pediatrics, University of California San Francisco, San Francisco, CA.

Division of Pediatric Gastroenterology, Stanford University School of Medicine, Palo Alto, CA.

出版信息

medRxiv. 2024 Sep 6:2024.09.05.24313139. doi: 10.1101/2024.09.05.24313139.

Abstract

BACKGROUND AND AIMS

Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting three IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across two institutions.

METHODS

Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California San Francisco (UCSF), and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive and negative predictive value. Additionally, we conducted fairness and error assessments.

RESULTS

Inter-rater reliability between annotators was >90%. On the UCSF test set (n=50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n=250), tNLP models failed to generalize (61-62% accuracy) while GPT-4 maintained accuracies >90%. PaLM-2 and GPT-4 showed similar performance. No biases were detected based on demographics or diagnosis.

CONCLUSIONS

LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.

摘要

背景与目的

患者报告结局(PROs)对于评估炎症性肠病(IBD)的疾病活动度和治疗结局至关重要。然而,从临床记录的自由文本中手动提取这些PROs非常繁琐。我们旨在改进从电子健康记录中的自由文本信息进行数据整理,使其更便于用于研究和质量改进。本研究旨在比较传统自然语言处理(tNLP)和大语言模型(LLMs)在从两个机构的临床记录中提取三种IBD PROs(腹痛、腹泻、便血)方面的效果。

方法

使用预设方案对每种PRO的临床记录进行注释。模型在加利福尼亚大学旧金山分校(UCSF)开发并进行内部测试,然后在斯坦福大学进行外部验证。我们比较了基于tNLP和LLM的模型在准确性、敏感性、特异性、阳性和阴性预测值方面的表现。此外,我们还进行了公平性和误差评估。

结果

注释者之间的评分者间信度>90%。在UCSF测试集(n = 50)上,表现最佳的tNLP模型在腹痛、腹泻和便血方面的准确率分别为92%、82%和80%,与GPT-4相当,后者的准确率分别为96%、88%和90%。在斯坦福大学的外部验证(n = 250)中,tNLP模型未能泛化(准确率为61 - 62%),而GPT-4的准确率保持在>90%。PaLM-2和GPT-4表现相似。未检测到基于人口统计学或诊断的偏差。

结论

大语言模型是提取PROs的准确且可泛化的方法。尽管记录模板和作者存在异质性,但它们在各机构中均保持出色的准确性。广泛采用此类工具有可能加强IBD研究和患者护理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/50ae/11398594/68244ffc22db/nihpp-2024.09.05.24313139v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验