Ha Emily, Choon-Kon-Yune Isabelle, Murray LaShawn, Luan Siying, Montague Enid, Bhattacharyya Onil, Agarwal Payal
Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
Women's College Hospital Institute for Health System Solutions and Virtual Care, Women's College Hospital, 76 Grenville Street, 6th Floor, Toronto, ON, M5S 1B2, Canada, 1 4163236400.
JMIR Hum Factors. 2025 Jul 23;12:e71434. doi: 10.2196/71434.
BACKGROUND: Primary care providers (PCPs) face significant burnout due to increasing administrative and documentation demands, contributing to job dissatisfaction and impacting care quality. Artificial intelligence (AI) scribes have emerged as potential solutions to reduce administrative burden by automating clinical documentation of patient encounters. Although AI scribes are gaining popularity in primary care, there is limited information on their usability, effectiveness, and accuracy. OBJECTIVE: This study aimed to develop and apply an evaluation framework to systematically assess the usability, technical performance, and accuracy of various AI scribes used in primary care settings across Canada and the United States. METHODS: We conducted a systematic comparison of a suite of AI scribes using competitive analysis methods. An evaluation framework was developed using expert usability approaches and human factors engineering principles and comprises 3 domains: usability, effectiveness and technical performance, and accuracy and quality. Audio files from 4 standardized patient encounters were used to generate transcripts and SOAP (Subjective, Objective, Assessment, and Plan)-format medical notes from each AI scribe. A verbatim transcript, detailed case notes, and physician-written medical notes for each audio file served as a benchmark for comparison against the AI-generated outputs. Applicable items were rated on a 3-point Likert scale (1=poor, 2=good, 3=excellent). Additional insights were gathered from clinical experts, vendor questionnaires, and public resources to support usability, effectiveness, and quality findings. RESULTS: In total, 6 AI scribes were evaluated, with notable performance differences. Most AI scribes could be accessed via various platforms (n=4) and launched within common electronic medical records, though data exchange capabilities were limited. Nearly all AI scribes generated SOAP-format notes in approximately 1 minute for a 15-minute standardized encounter (n=5), though documentation time increased with encounter length and topic complexity. While all AI scribes produced good to excellent quality medical notes, none were consistently error-free. Common errors included deletion, omission, and SOAP structure errors. Factors such as extraneous conversations and multiple speakers impacted the accuracy of both the transcript and medical note, with some AI scribes producing excellent notes despite minor transcript issues and vice versa. Limitations in usability, technical performance, and accuracy suggest areas for improvement to fully realize AI scribes' potential in reducing administrative burden for PCPs. CONCLUSIONS: This study offers one of the first systematic evaluations of the usability, effectiveness, and accuracy of a suite of AI scribes currently used in primary care, providing benchmark data for further research, policy, and practice. While AI scribes show promise in reducing documentation burdens, improvements and ongoing evaluations are essential to ensure safe and effective use. Future studies should assess AI scribe performance in real-world settings across diverse populations to support equitable and reliable applications.
背景:由于行政和文档要求不断增加,基层医疗服务提供者(PCP)面临着严重的职业倦怠,这导致工作满意度下降并影响医疗质量。人工智能(AI)书记员已成为通过自动记录患者会诊的临床文档来减轻行政负担的潜在解决方案。尽管AI书记员在基层医疗中越来越受欢迎,但关于其可用性、有效性和准确性的信息有限。 目的:本研究旨在开发并应用一个评估框架,以系统评估加拿大和美国基层医疗环境中使用的各种AI书记员的可用性、技术性能和准确性。 方法:我们使用竞争分析方法对一组AI书记员进行了系统比较。利用专家可用性方法和人因工程学原理开发了一个评估框架,该框架包括3个领域:可用性、有效性和技术性能,以及准确性和质量。来自4次标准化患者会诊的音频文件被用于生成每个AI书记员的文字记录和SOAP(主观、客观、评估和计划)格式的病历。每个音频文件的逐字记录、详细病例记录和医生书写的病历用作与AI生成的输出进行比较的基准。适用项目采用3点李克特量表进行评分(1=差,2=好,3=优秀)。从临床专家、供应商问卷和公共资源中收集了更多见解,以支持关于可用性、有效性和质量的研究结果。 结果:总共评估了6个AI书记员,它们的性能存在显著差异。大多数AI书记员可以通过各种平台访问(n=4),并在常见的电子病历中启动,不过数据交换能力有限。对于15分钟的标准化会诊(n=5),几乎所有AI书记员都能在大约1分钟内生成SOAP格式的病历,不过文档记录时间会随着会诊长度和主题复杂性的增加而延长。虽然所有AI书记员生成的病历质量都为良好到优秀,但没有一个始终无差错。常见错误包括删除、遗漏和SOAP结构错误。诸如无关对话和多个说话者等因素会影响文字记录和病历的准确性,一些AI书记员尽管文字记录存在小问题,但仍能生成优秀的病历,反之亦然。可用性、技术性能和准确性方面的局限性表明需要改进的领域,以充分发挥AI书记员在减轻PCP行政负担方面的潜力。 结论:本研究首次对目前基层医疗中使用 的一组AI书记员的可用性、有效性和准确性进行了系统评估,为进一步的研究、政策和实践提供了基准数据。虽然AI书记员在减轻文档负担方面显示出了前景,但改进和持续评估对于确保安全有效使用至关重要。未来的研究应评估AI书记员在不同人群的真实环境中的性能,以支持公平和可靠的应用。
BMC Med Inform Decis Mak. 2025-7-1
Health Technol Assess. 2006-9
Healthcare (Basel). 2025-6-16
Cochrane Database Syst Rev. 2014-4-29
Health Technol Assess. 2001
Cochrane Database Syst Rev. 2022-5-20
CMAJ. 2024-9-15
CMAJ. 2024-3-24
J Am Med Inform Assoc. 2024-4-3
NPJ Digit Med. 2023-8-24