• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型的提供者文档摘要质量工具的开发与验证

Development and validation of the provider documentation summarization quality instrument for large language models.

作者信息

Croxford Emma, Gao Yanjun, Pellegrino Nicholas, Wong Karen, Wills Graham, First Elliot, Schnier Miranda, Burton Kyle, Ebby Cris, Gorski Jillian, Kalscheur Matthew, Khalil Samy, Pisani Marie, Rubeor Tyler, Stetson Peter, Liao Frank, Goswami Cherodeep, Patterson Brian, Afshar Majid

机构信息

Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53792, United States.

Department of Biomedical Informatics, University of Colorado-Anschutz Medical, Aurora, CO 80045, United States.

出版信息

J Am Med Inform Assoc. 2025 Jun 1;32(6):1050-1060. doi: 10.1093/jamia/ocaf068.

DOI:10.1093/jamia/ocaf068
PMID:40323321
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12089781/
Abstract

OBJECTIVES

As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.

MATERIALS AND METHODS

Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.

RESULTS

Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).

DISCUSSION

The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.

CONCLUSIONS

The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.

摘要

目的

随着大语言模型(LLMs)被整合到电子健康记录(EHR)工作流程中,在实施前以及随着模型和文档实践的发展,经过验证的工具对于评估其性能至关重要。现有的用于评估医疗服务提供者文档质量的工具通常不适用于大语言模型生成文本的复杂性,并且缺乏对真实世界数据的验证。开发了医疗服务提供者文档摘要质量工具(PDSQI - 9)来评估大语言模型生成的临床摘要。本研究旨在验证PDSQI - 9在结构效度的关键方面的有效性。

材料与方法

使用多个大语言模型(GPT - 4o、Mixtral 8x7b和Llama 3 - 8b)从多个专科的真实世界电子健康记录数据中生成多文档摘要。验证包括用于实质效度的Pearson相关分析、用于结构效度的因子分析和Cronbach's α、用于可推广性的评分者间信度(ICC和Krippendorff's α)、用于内容效度的半德尔菲法,以及用于区分效度的高质量与低质量摘要的比较。评分者接受标准化培训以确保工具的一致应用。

结果

七名医生评分者评估了779份摘要并回答了8329个问题,评分者间信度的检验效能超过80%。PDSQI - 9表现出很强的内部一致性(Cronbach's α = 0.879;95% CI,0.867 - 0.891)和很高的评分者间信度(ICC = 0.867;95% CI,0.867 - 0.868),支持结构效度和可推广性。因子分析确定了一个解释58%方差的四因子模型,代表组织、清晰度、准确性和实用性。笔记长度与简洁性得分(ρ = -0.200,P = 0.029)和条理性得分(ρ = -0.190,P = 0.037)之间的相关性支持了实质效度。半德尔菲法确保了临床相关属性,区分效度区分了高质量与低质量摘要(P <.001)。

讨论

PDSQI - 9表现出很高的评分者间信度、内部一致性和有意义的因子结构,能够可靠地捕捉文档质量的关键维度。它区分了高质量与低质量摘要,支持其在需要大语言模型评估工具的卫生系统中的实际效用。

结论

PDSQI - 9展示了强大的结构效度,支持其在临床实践中用于评估大语言模型生成的摘要,并促进大语言模型更安全、有效地整合到医疗工作流程中。

相似文献

1
Development and validation of the provider documentation summarization quality instrument for large language models.大型语言模型的提供者文档摘要质量工具的开发与验证
J Am Med Inform Assoc. 2025 Jun 1;32(6):1050-1060. doi: 10.1093/jamia/ocaf068.
2
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
3
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
4
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
5
The measurement of collaboration within healthcare settings: a systematic review of measurement properties of instruments.医疗机构内协作的测量:对测量工具属性的系统评价
JBI Database System Rev Implement Rep. 2016 Apr;14(4):138-97. doi: 10.11124/JBISRIR-2016-2159.
6
A comparative study of recent large language models on generating hospital discharge summaries for lung cancer patients.近期大型语言模型在生成肺癌患者出院小结方面的比较研究。
J Biomed Inform. 2025 Aug;168:104867. doi: 10.1016/j.jbi.2025.104867. Epub 2025 Jun 20.
7
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.
8
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
9
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
10
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署:用于日语疾病名称识别的微调大语言模型的模型开发与验证
JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.

引用本文的文献

1
A Randomized-Clinical Trial of Two Ambient Artificial Intelligence Scribes: Measuring Documentation Efficiency and Physician Burnout.两种环境人工智能抄写员的随机临床试验:衡量文档记录效率和医生职业倦怠。
medRxiv. 2025 Jul 11:2025.07.10.25331333. doi: 10.1101/2025.07.10.25331333.
2
Verifiable Summarization of Electronic Health Records Using Large Language Models to Support Chart Review.使用大语言模型对电子健康记录进行可验证的摘要以支持病历审查。
medRxiv. 2025 Jun 3:2025.06.02.25328807. doi: 10.1101/2025.06.02.25328807.
3
Harnessing the power of large language models for clinical tasks and synthesis of scientific literature.利用大语言模型的能力来完成临床任务和综合科学文献。
J Am Med Inform Assoc. 2025 Jun 1;32(6):983-984. doi: 10.1093/jamia/ocaf071.
4
Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge.使用大语言模型(LLM)作为评判器对医疗保健领域的人工智能文本生成进行自动化评估。
medRxiv. 2025 May 6:2025.04.22.25326219. doi: 10.1101/2025.04.22.25326219.

本文引用的文献

1
Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses.用于医学诊断自然语言生成的人工评估框架的开发及其与自动指标的相关性
AMIA Annu Symp Proc. 2025 May 22;2024:309-318. eCollection 2024.
2
A strategy for cost-effective large language model use at health system-scale.一种在卫生系统规模上经济高效使用大语言模型的策略。
NPJ Digit Med. 2024 Nov 18;7(1):320. doi: 10.1038/s41746-024-01315-1.
3
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.大语言模型在医疗保健应用中的测试与评估:一项系统综述。
JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.
4
A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization.长格式医院病程总结忠实度指标的元评估
Proc Mach Learn Res. 2023 Aug;219:2-30.
5
A framework for human evaluation of large language models in healthcare derived from literature review.一个源自文献综述的用于医疗保健领域大语言模型人工评估的框架。
NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.
6
Toward Clinical Generative AI: Conceptual Framework.迈向临床生成式人工智能:概念框架
JMIR AI. 2024 Jun 7;3:e55957. doi: 10.2196/55957.
7
Effect of Ambient Voice Technology, Natural Language Processing, and Artificial Intelligence on the Patient-Physician Relationship.环境语音技术、自然语言处理和人工智能对医患关系的影响。
Appl Clin Inform. 2024 Aug;15(4):660-667. doi: 10.1055/a-2337-4739. Epub 2024 Jun 4.
8
Call me Dr Ishmael: trends in electronic health record notes available at emergency department visits and admissions.叫我以实玛利医生:急诊科就诊和住院时电子健康记录笔记的趋势
JAMIA Open. 2024 May 22;7(2):ooae039. doi: 10.1093/jamiaopen/ooae039. eCollection 2024 Jul.
9
Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study.利用 ChatGPT-4 从医患对话的音频记录中创建结构化的医疗记录:比较研究。
J Med Internet Res. 2024 Apr 22;26:e54419. doi: 10.2196/54419.
10
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.