Croxford Emma, Gao Yanjun, Pellegrino Nicholas, Wong Karen, Wills Graham, First Elliot, Schnier Miranda, Burton Kyle, Ebby Cris, Gorski Jillian, Kalscheur Matthew, Khalil Samy, Pisani Marie, Rubeor Tyler, Stetson Peter, Liao Frank, Goswami Cherodeep, Patterson Brian, Afshar Majid
Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53792, United States.
Department of Biomedical Informatics, University of Colorado-Anschutz Medical, Aurora, CO 80045, United States.
J Am Med Inform Assoc. 2025 Jun 1;32(6):1050-1060. doi: 10.1093/jamia/ocaf068.
As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.
Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.
Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).
The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.
The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.
随着大语言模型(LLMs)被整合到电子健康记录(EHR)工作流程中,在实施前以及随着模型和文档实践的发展,经过验证的工具对于评估其性能至关重要。现有的用于评估医疗服务提供者文档质量的工具通常不适用于大语言模型生成文本的复杂性,并且缺乏对真实世界数据的验证。开发了医疗服务提供者文档摘要质量工具(PDSQI - 9)来评估大语言模型生成的临床摘要。本研究旨在验证PDSQI - 9在结构效度的关键方面的有效性。
使用多个大语言模型(GPT - 4o、Mixtral 8x7b和Llama 3 - 8b)从多个专科的真实世界电子健康记录数据中生成多文档摘要。验证包括用于实质效度的Pearson相关分析、用于结构效度的因子分析和Cronbach's α、用于可推广性的评分者间信度(ICC和Krippendorff's α)、用于内容效度的半德尔菲法,以及用于区分效度的高质量与低质量摘要的比较。评分者接受标准化培训以确保工具的一致应用。
七名医生评分者评估了779份摘要并回答了8329个问题,评分者间信度的检验效能超过80%。PDSQI - 9表现出很强的内部一致性(Cronbach's α = 0.879;95% CI,0.867 - 0.891)和很高的评分者间信度(ICC = 0.867;95% CI,0.867 - 0.868),支持结构效度和可推广性。因子分析确定了一个解释58%方差的四因子模型,代表组织、清晰度、准确性和实用性。笔记长度与简洁性得分(ρ = -0.200,P = 0.029)和条理性得分(ρ = -0.190,P = 0.037)之间的相关性支持了实质效度。半德尔菲法确保了临床相关属性,区分效度区分了高质量与低质量摘要(P <.001)。
PDSQI - 9表现出很高的评分者间信度、内部一致性和有意义的因子结构,能够可靠地捕捉文档质量的关键维度。它区分了高质量与低质量摘要,支持其在需要大语言模型评估工具的卫生系统中的实际效用。
PDSQI - 9展示了强大的结构效度,支持其在临床实践中用于评估大语言模型生成的摘要,并促进大语言模型更安全、有效地整合到医疗工作流程中。