文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

基于大语言模型的病史采集训练系统的开发与验证:关于评估稳定性、人机一致性和透明度的前瞻性多案例研究

Development and Validation of a Large Language Model-Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency.

作者信息

Liu Yang, Shi Chujun, Wu Liping, Lin Xiule, Chen Xiaoqin, Zhu Yiying, Tan Haizhu, Zhang Weishan

机构信息

Medical Simulation Center, Shantou University Medical College, No. 22 Xinling Road, Shantou, 515041, China, 86 754-88900459.

Department of Medical Physics and Informatics, Shantou University Medical College, Shantou, China.

出版信息

JMIR Med Educ. 2025 Aug 29;11:e73419. doi: 10.2196/73419.


DOI:10.2196/73419
PMID:40882613
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12396829/
Abstract

BACKGROUND: History-taking is crucial in medical training. However, current methods often lack consistent feedback and standardized evaluation and have limited access to standardized patient (SP) resources. Artificial intelligence (AI)-powered simulated patients offer a promising solution; however, challenges such as human-AI consistency, evaluation stability, and transparency remain underexplored in multicase clinical scenarios. OBJECTIVE: This study aimed to develop and validate the AI-Powered Medical History-Taking Training and Evaluation System (AMTES), based on DeepSeek-V2.5 (DeepSeek), to assess its stability, human-AI consistency, and transparency in clinical scenarios with varying symptoms and difficulty levels. METHODS: We developed AMTES, a system using multiple strategies to ensure dialog quality and automated assessment. A prospective study with 31 medical students evaluated AMTES's performance across 3 cases of varying complexity: a simple case (cough), a moderate case (frequent urination), and a complex case (abdominal pain). To validate our design, we conducted systematic baseline comparisons to measure the incremental improvements from each level of our design approach and tested the framework's generalizability by implementing it with an alternative large language model (LLM) Qwen-Max (Qwen AI; version 20250409), under a zero-modification condition. RESULTS: A total of 31 students practiced with our AMTES. During the training, students generated 8606 questions across 93 history-taking sessions. AMTES achieved high dialog accuracy: 98.6% (SD 1.5%) for cough, 99.0% (SD 1.1%) for frequent urination, and 97.9% (SD 2.2%) for abdominal pain, with contextual appropriateness exceeding 99%. The system's automated assessments demonstrated exceptional stability and high human-AI consistency, supported by transparent, evidence-based rationales. Specifically, the coefficients of variation (CV) were low across total scores (0.87%-1.12%) and item-level scoring (0.55%-0.73%). Total score consistency was robust, with the intraclass correlation coefficients (ICCs) exceeding 0.923 across all scenarios, showing strong agreement. The item-level consistency was remarkably high, consistently above 95%, even for complex cases like abdominal pain (95.75% consistency). In systematic baseline comparisons, the fully-processed system improved ICCs from 0.414/0.500 to 0.923/0.972 (moderate and complex cases), with all CVs ≤1.2% across the 3 cases. A zero-modification implementation of our evaluation framework with an alternative LLM (Qwen-Max) achieved near-identical performance, with the item-level consistency rates over 94.5% and ICCs exceeding 0.89. Overall, 87% of students found AMTES helpful, and 83% expressed a desire to use it again in the future. CONCLUSIONS: Our data showed that AMTES demonstrates significant educational value through its LLM-based virtual SPs, which successfully provided authentic clinical dialogs with high response accuracy and delivered consistent, transparent educational feedback. Combined with strong user approval, these findings highlight AMTES's potential as a valuable, adaptable, and generalizable tool for medical history-taking training across various educational contexts.

摘要

背景:病史采集在医学培训中至关重要。然而,当前方法往往缺乏一致的反馈和标准化评估,且获取标准化患者(SP)资源的途径有限。人工智能(AI)驱动的模拟患者提供了一个有前景的解决方案;然而,在多病例临床场景中,诸如人机一致性、评估稳定性和透明度等挑战仍未得到充分探索。 目的:本研究旨在基于DeepSeek-V2.5(深寻)开发并验证人工智能驱动的病史采集训练与评估系统(AMTES),以评估其在具有不同症状和难度水平的临床场景中的稳定性、人机一致性和透明度。 方法:我们开发了AMTES,这是一个使用多种策略来确保对话质量和自动评估的系统。一项针对31名医学生的前瞻性研究评估了AMTES在3个不同复杂程度病例中的表现:一个简单病例(咳嗽)、一个中等病例(尿频)和一个复杂病例(腹痛)。为了验证我们的设计,我们进行了系统的基线比较,以衡量我们设计方法每个层次的增量改进,并通过在零修改条件下用另一个大语言模型(LLM)Qwen-Max(清问人工智能;版本20250409)实施该框架来测试其通用性。 结果:共有31名学生使用我们的AMTES进行练习。在训练期间,学生们在93次病史采集会话中提出了8606个问题。AMTES实现了较高的对话准确率:咳嗽病例为98.6%(标准差1.5%),尿频病例为99.0%(标准差1.1%),腹痛病例为97.9%(标准差2.2%),上下文恰当性超过99%。该系统的自动评估显示出卓越的稳定性和较高的人机一致性,并得到了透明的、基于证据的理由支持。具体而言,总分的变异系数(CV)较低(0.87%-1.12%)以及项目级评分的变异系数较低(0.55%-0.73%)。总分一致性很强,所有场景中的组内相关系数(ICC)均超过0.923,显示出高度一致性。项目级一致性非常高,始终高于95%,即使是像腹痛这样的复杂病例(一致性为95.75%)。在系统的基线比较中,经过全面处理的系统将中等和复杂病例的ICC从0.414/0.500提高到了0.923/0.972,3个病例的所有CV均≤1.2%。我们的评估框架在零修改的情况下用另一个LLM(Qwen-Max)实施,取得了几乎相同的性能,项目级一致性率超过94.5%,ICC超过0.89。总体而言,87% 的学生认为AMTES有帮助,83% 的学生表示希望未来再次使用它。 结论:我们的数据表明,AMTES通过其基于大语言模型的虚拟标准化患者展现出显著的教育价值,这些虚拟标准化患者成功提供了具有高回答准确率的真实临床对话,并提供了一致、透明的教育反馈。再加上用户的高度认可,这些发现凸显了AMTES作为一种有价值、适应性强且通用的工具在各种教育背景下进行病史采集训练的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/43a7a6fd3b7b/mededu-v11-e73419-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/fb906890e6bd/mededu-v11-e73419-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/1d3970ebb456/mededu-v11-e73419-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/28112587b3f1/mededu-v11-e73419-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/407dc3be4b29/mededu-v11-e73419-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/43a7a6fd3b7b/mededu-v11-e73419-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/fb906890e6bd/mededu-v11-e73419-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/1d3970ebb456/mededu-v11-e73419-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/28112587b3f1/mededu-v11-e73419-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/407dc3be4b29/mededu-v11-e73419-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ac/12396829/43a7a6fd3b7b/mededu-v11-e73419-g005.jpg

相似文献

[1]
Development and Validation of a Large Language Model-Based System for Medical History-Taking Training: Prospective Multicase Study on Evaluation Stability, Human-AI Consistency, and Transparency.

JMIR Med Educ. 2025-8-29

[2]
Prescription of Controlled Substances: Benefits and Risks

2025-1

[3]
User Intent to Use DeepSeek for Health Care Purposes and Their Trust in the Large Language Model: Multinational Survey Study.

JMIR Hum Factors. 2025-5-26

[4]
Development and Validation of a Large Language Model-Powered Chatbot for Neurosurgery: Mixed Methods Study on Enhancing Perioperative Patient Education.

J Med Internet Res. 2025-7-15

[5]
Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis.

J Med Internet Res. 2025-8-12

[6]
Utility of Generative Artificial Intelligence for Japanese Medical Interview Training: Randomized Crossover Pilot Study.

JMIR Med Educ. 2025-8-1

[7]
Development of a GPT-4-Powered Virtual Simulated Patient and Communication Training Platform for Medical Students to Practice Discussing Abnormal Mammogram Results With Patients: Multiphase Study.

JMIR Form Res. 2025-4-17

[8]
AI in Medical Questionnaires: Innovations, Diagnosis, and Implications.

J Med Internet Res. 2025-6-23

[9]
A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.

Clin Orthop Relat Res. 2025-4-1

[10]
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022-5-20

本文引用的文献

[1]
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.

J Med Internet Res. 2025-4-28

[2]
Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback.

J Med Internet Res. 2025-4-4

[3]
Artificial Intelligence in Health Professions Education assessment: AMEE Guide No. 178.

Med Teach. 2025-9

[4]
Application of Large Language Models in Medical Training Evaluation-Using ChatGPT as a Standardized Patient: Multimetric Assessment.

J Med Internet Res. 2025-1-1

[5]
Medical Education and Artificial Intelligence: Web of Science-Based Bibliometric Analysis (2013-2022).

JMIR Med Educ. 2024-10-10

[6]
Analysis of virtual standardized patients for assessing clinical fundamental skills of medical students: a prospective study.

BMC Med Educ. 2024-9-10

[7]
A retrospective evaluation of the potential of ChatGPT in the accurate diagnosis of acute stroke.

Diagn Interv Radiol. 2025-4-28

[8]
A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.

JMIR Med Educ. 2024-8-16

[9]
Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.

JMIR Med Inform. 2024-6-28

[10]
Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries.

J Med Internet Res. 2024-7-23

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索