Schoonbeek Rosanne C, Workum Jessica D, Schuit Stephanie C E, Hoekman Anne H, Mehri Tarannom, Doornberg Job N, van der Laan Tom P, Bootsma-Robroeks Charlotte M H H T
Department of Otolaryngology - Head and Neck Surgery, University Medical Centre Groningen, Groningen, The Netherlands
Department of Medical Information Technology, University Medical Centre Groningen, Groningen, The Netherlands.
BMJ Open. 2025 Sep 4;15(9):e099301. doi: 10.1136/bmjopen-2025-099301.
To compare the quality and time efficiency of physician-written summaries with customised large language model (LLM)-generated medical summaries integrated into the electronic health record (EHR) in a non-English clinical environment.
Cross-sectional non-inferiority validation study.
Tertiary academic hospital.
52 physicians from 8 specialties at a large Dutch academic hospital participated, either in writing summaries (n=42) or evaluating them (n=10).
Physician writers wrote summaries of 50 patient records. LLM-generated summaries were created for the same records using an EHR-integrated LLM. An independent, blinded panel of physician evaluators compared physician-written summaries to LLM-generated summaries.
Primary outcome measures were completeness, correctness and conciseness (on a 5-point Likert scale). Secondary outcomes were preference and trust, and time to generate either the physician-written or LLM-generated summary.
The completeness and correctness of LLM-generated summaries did not differ significantly from physician-written summaries. However, LLM summaries were less concise (3.0 vs 3.5, p=0.001). Overall evaluation scores were similar (3.4 vs 3.3, p=0.373), with 57% of evaluators preferring LLM-generated summaries. Trust in both summary types was comparable, and interobserver variability showed excellent reliability (intraclass correlation coefficient 0.975). Physicians took an average of 7 min per summary, while LLMs completed the same task in just 15.7 s.
LLM-generated summaries are comparable to physician-written summaries in completeness and correctness, although slightly less concise. With a clear time-saving benefit, LLMs could help reduce clinicians' administrative burden without compromising summary quality.
在非英语临床环境中,比较医生撰写的总结与集成到电子健康记录(EHR)中的定制大语言模型(LLM)生成的医学总结的质量和时间效率。
横断面非劣效性验证研究。
三级学术医院。
荷兰一家大型学术医院的8个专业的52名医生参与,其中42名撰写总结,10名评估总结。
医生撰写50份患者记录的总结。使用集成在EHR中的LLM为相同记录生成LLM总结。一个独立的、不知情的医生评估小组将医生撰写的总结与LLM生成的总结进行比较。
主要结局指标为完整性、正确性和简洁性(采用5分李克特量表)。次要结局指标为偏好和信任,以及生成医生撰写或LLM生成总结的时间。
LLM生成的总结的完整性和正确性与医生撰写的总结没有显著差异。然而,LLM总结的简洁性较差(3.0对3.5,p=0.001)。总体评估得分相似(3.4对3.3,p=0.373),57%的评估者更喜欢LLM生成的总结。对两种总结类型的信任程度相当,观察者间的变异性显示出极好的可靠性(组内相关系数0.975)。医生撰写每份总结平均需要7分钟,而LLM只需15.7秒就能完成相同任务。
LLM生成的总结在完整性和正确性方面与医生撰写的总结相当,尽管简洁性略差。LLM具有明显的省时优势,可以帮助减轻临床医生的管理负担,而不影响总结质量。