• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用 ChatGPT-4 从医患对话的音频记录中创建结构化的医疗记录:比较研究。

Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study.

机构信息

Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Sciences University, Portland, OR, United States.

出版信息

J Med Internet Res. 2024 Apr 22;26:e54419. doi: 10.2196/54419.

DOI:10.2196/54419
PMID:38648636
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11074889/
Abstract

BACKGROUND

Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)-powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows.

OBJECTIVE

This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model's performance across different categories.

METHODS

We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system.

RESULTS

Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the "Objective" section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05).

CONCLUSIONS

Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model's effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/f1a61207bf17/jmir_v26i1e54419_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/1cc4add5b8ee/jmir_v26i1e54419_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/c72764527820/jmir_v26i1e54419_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/34167861e93c/jmir_v26i1e54419_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/647bf4c09c58/jmir_v26i1e54419_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/23711e70273d/jmir_v26i1e54419_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/f1a61207bf17/jmir_v26i1e54419_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/1cc4add5b8ee/jmir_v26i1e54419_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/c72764527820/jmir_v26i1e54419_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/34167861e93c/jmir_v26i1e54419_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/647bf4c09c58/jmir_v26i1e54419_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/23711e70273d/jmir_v26i1e54419_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a8c/11074889/f1a61207bf17/jmir_v26i1e54419_fig6.jpg
摘要

背景

医学文献在临床实践中起着至关重要的作用,有助于准确管理患者并促进医疗保健专业人员之间的沟通。然而,医疗记录中的不准确之处可能导致沟通失误和诊断错误。此外,文献记录的需求导致医生倦怠。尽管已经使用了医疗抄写员和语音识别软件等中介来减轻这种负担,但它们在准确性和满足特定提供者的指标方面存在局限性。集成环境人工智能 (AI) 驱动的解决方案提供了一种有前途的方法,可以在无缝融入现有工作流程的同时提高文档质量。

目的

本研究旨在评估 ChatGPT-4(一种 AI 模型)生成的主观、客观、评估和计划 (SOAP) 记录的准确性和质量,使用历史和体检的既定记录作为黄金标准。我们旨在识别潜在错误并评估模型在不同类别中的性能。

方法

我们进行了模拟的医患就诊,代表了各种门诊专业,并对音频文件进行了转录。确定了可报告的关键要素,并使用这些记录生成 ChatGPT-4 生成基于这些记录的 SOAP 记录。为每个记录创建了三个版本,并通过图表审查与黄金标准进行比较;通过比较生成的错误被归类为遗漏、信息错误或添加。我们比较了不同版本、转录长度和数据类别之间的数据元素的准确性。此外,我们使用医师文献质量工具 (PDQI) 评分系统评估记录质量。

结果

尽管 ChatGPT-4 始终如一地生成 SOAP 风格的记录,但平均每个临床病例有 23.6 个错误,遗漏错误(86%)最为常见,其次是添加错误(10.5%)和包含不正确事实(3.2%)。同一病例的重复之间存在显著差异,所有 3 个重复中只有 52.9%的数据元素报告正确。数据元素的准确性因病例而异,在“客观”部分观察到最高的准确性。因此,通过 PDQI 评估的记录质量衡量标准表现出了病例内和病例间的差异。最后,ChatGPT-4 的准确性与转录长度(P=.05)和可评分数据元素数量(P=.05)呈负相关。

结论

我们的研究揭示了 ChatGPT-4 生成的错误、准确性和记录质量存在很大差异。错误不仅限于特定部分,并且重复之间错误类型的不一致使得可预测性变得复杂。转录长度和数据复杂性与记录准确性呈负相关,这引发了对模型处理复杂医疗病例的有效性的担忧。ChatGPT-4 生成的临床记录的质量和可靠性不符合临床使用的标准。虽然人工智能在医疗保健中有很大的应用前景,但在广泛采用之前应该谨慎行事。需要进一步研究以解决准确性、可变性和潜在错误问题。ChatGPT-4 在各种应用中很有价值,但在现阶段不应被视为人类生成的临床文档的安全替代方案。

相似文献

1
Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study.利用 ChatGPT-4 从医患对话的音频记录中创建结构化的医疗记录:比较研究。
J Med Internet Res. 2024 Apr 22;26:e54419. doi: 10.2196/54419.
2
Evaluating the Usability, Technical Performance, and Accuracy of Artificial Intelligence Scribes for Primary Care: Competitive Analysis.评估用于初级保健的人工智能抄写员的可用性、技术性能和准确性:竞争分析
JMIR Hum Factors. 2025 Jul 23;12:e71434. doi: 10.2196/71434.
3
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
4
AI Scribes in Health Care: Balancing Transformative Potential With Responsible Integration.医疗保健领域的人工智能抄写员:平衡变革潜力与负责任的整合
JMIR Med Inform. 2025 Aug 1;13:e80898. doi: 10.2196/80898.
5
Navigating the future of pediatric cardiovascular surgery: Insights and innovation powered by Chat Generative Pre-Trained Transformer (ChatGPT).探索小儿心血管外科的未来:由聊天生成预训练变换器(ChatGPT)推动的见解与创新。
J Thorac Cardiovasc Surg. 2025 Feb 1. doi: 10.1016/j.jtcvs.2025.01.022.
6
The educational effects of portfolios on undergraduate student learning: a Best Evidence Medical Education (BEME) systematic review. BEME Guide No. 11.档案袋对本科学生学习的教育效果:最佳证据医学教育(BEME)系统评价。BEME指南第11号。
Med Teach. 2009 Apr;31(4):282-98. doi: 10.1080/01421590902889897.
7
AI in Medical Questionnaires: Innovations, Diagnosis, and Implications.医学问卷中的人工智能:创新、诊断及影响
J Med Internet Res. 2025 Jun 23;27:e72398. doi: 10.2196/72398.
8
Sexual Harassment and Prevention Training性骚扰与预防培训
9
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
10
Documenting Care with AI: A Comparative Analysis of Commercial Scribe Tools.使用人工智能记录护理:商业抄写工具的比较分析
Stud Health Technol Inform. 2025 Aug 7;329:337-341. doi: 10.3233/SHTI250857.

引用本文的文献

1
The impact of an artificial intelligence enhancement program on healthcare providers' knowledge, attitudes, and workplace flourishing.人工智能强化计划对医疗保健提供者的知识、态度和工作场所繁荣的影响。
Front Public Health. 2025 Aug 7;13:1639333. doi: 10.3389/fpubh.2025.1639333. eCollection 2025.
2
Transforming Cancer Care: A Narrative Review on Leveraging Artificial Intelligence to Advance Immunotherapy in Underserved Communities.变革癌症护理:关于利用人工智能推进服务不足社区免疫治疗的叙述性综述。
J Clin Med. 2025 Jul 29;14(15):5346. doi: 10.3390/jcm14155346.
3
AI Scribes in Health Care: Balancing Transformative Potential With Responsible Integration.

本文引用的文献

1
Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument.ChatGPT 提供的医学信息的可靠性:与临床指南和患者信息质量工具的评估。
J Med Internet Res. 2023 Jun 30;25:e47479. doi: 10.2196/47479.
2
ChatGPT is not the solution to physicians' documentation burden.ChatGPT并非解决医生文书负担的办法。
Nat Med. 2023 Jun;29(6):1296-1297. doi: 10.1038/s41591-023-02341-4.
3
What if your patient switches from Dr. Google to Dr. ChatGPT? A vignette-based survey of the trustworthiness, value, and danger of ChatGPT-generated responses to health questions.
医疗保健领域的人工智能抄写员:平衡变革潜力与负责任的整合
JMIR Med Inform. 2025 Aug 1;13:e80898. doi: 10.2196/80898.
4
Evaluating the Usability, Technical Performance, and Accuracy of Artificial Intelligence Scribes for Primary Care: Competitive Analysis.评估用于初级保健的人工智能抄写员的可用性、技术性能和准确性:竞争分析
JMIR Hum Factors. 2025 Jul 23;12:e71434. doi: 10.2196/71434.
5
General practitioners' opinions of generative artificial intelligence in the UK: An online survey.英国全科医生对生成式人工智能的看法:一项在线调查。
Digit Health. 2025 Jul 17;11:20552076251360863. doi: 10.1177/20552076251360863. eCollection 2025 Jan-Dec.
6
Impact of artificial intelligence on electronic health record-related burnouts among healthcare professionals: systematic review.人工智能对医疗保健专业人员中与电子健康记录相关的职业倦怠的影响:系统评价
Front Public Health. 2025 Jul 3;13:1628831. doi: 10.3389/fpubh.2025.1628831. eCollection 2025.
7
The Impact of AI Scribes on Streamlining Clinical Documentation: A Systematic Review.人工智能抄写员对简化临床文档的影响:一项系统综述。
Healthcare (Basel). 2025 Jun 16;13(12):1447. doi: 10.3390/healthcare13121447.
8
Improving Patient Communication by Simplifying AI-Generated Dental Radiology Reports With ChatGPT: Comparative Study.通过使用ChatGPT简化人工智能生成的牙科放射学报告来改善患者沟通:比较研究
J Med Internet Res. 2025 Jun 9;27:e73337. doi: 10.2196/73337.
9
Artificial intelligence-driven natural language processing for identifying linguistic patterns in Alzheimer's disease and mild cognitive impairment: A study of lexical, syntactic, and cohesive features of speech through picture description tasks.人工智能驱动的自然语言处理用于识别阿尔茨海默病和轻度认知障碍中的语言模式:通过图片描述任务对言语的词汇、句法和衔接特征的研究
J Alzheimers Dis. 2025 Jul;106(1):120-138. doi: 10.1177/13872877251339756. Epub 2025 May 7.
10
Development and validation of the provider documentation summarization quality instrument for large language models.大型语言模型的提供者文档摘要质量工具的开发与验证
J Am Med Inform Assoc. 2025 Jun 1;32(6):1050-1060. doi: 10.1093/jamia/ocaf068.
如果你的患者从谷歌医生转向了 ChatGPT 医生,你会怎么办?基于病例的调查,评估 ChatGPT 生成的健康问题回答的可信度、价值和危险。
Eur J Cardiovasc Nurs. 2024 Jan 12;23(1):95-98. doi: 10.1093/eurjcn/zvad038.
4
Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study.基于生成式预训练 Transformer 3 聊天机器人为常见主诉临床病例生成鉴别诊断列表的诊断准确性:一项初步研究。
Int J Environ Res Public Health. 2023 Feb 15;20(4):3378. doi: 10.3390/ijerph20043378.
5
Artificial intelligence bot ChatGPT in medical research: the potential game changer as a double-edged sword.医学研究中的人工智能聊天机器人ChatGPT:作为一把双刃剑的潜在游戏规则改变者。
Knee Surg Sports Traumatol Arthrosc. 2023 Apr;31(4):1187-1189. doi: 10.1007/s00167-023-07355-6. Epub 2023 Feb 21.
6
Medical Record Closure Practices of Physicians Before and After the Use of Medical Scribes.使用医疗抄写员前后医生的病历封存做法。
JAMA. 2022 Oct 4;328(13):1350-1352. doi: 10.1001/jama.2022.13558.
7
Medical Documentation Burden Among US Office-Based Physicians in 2019: A National Study.2019 年美国办公医生的医疗文档负担:一项全国性研究。
JAMA Intern Med. 2022 May 1;182(5):564-566. doi: 10.1001/jamainternmed.2022.0372.
8
Comparing Scribed and Non-scribed Outpatient Progress Notes.比较有抄录和无抄录的门诊病历。
AMIA Annu Symp Proc. 2022 Feb 21;2021:1059-1068. eCollection 2021.
9
Chart Completion Time of Attending Physicians While Using Medical Scribes.医师使用医疗抄写员时的图表完成时间。
AMIA Annu Symp Proc. 2022 Feb 21;2021:457-465. eCollection 2021.
10
The future of medical scribes documenting in the electronic health record: results of an expert consensus conference.医疗记录员在电子健康记录中的未来:专家共识会议的结果。
BMC Med Inform Decis Mak. 2021 Jun 29;21(1):204. doi: 10.1186/s12911-021-01560-4.