Suppr超能文献

生成式人工智能系统解决临床药学问题的比较分析:混合方法研究

Comparative Analysis of Generative Artificial Intelligence Systems in Solving Clinical Pharmacy Problems: Mixed Methods Study.

作者信息

Li Lulu, Du Pengqiang, Huang Xiaojing, Zhao Hongwei, Ni Ming, Yan Meng, Wang Aifeng

机构信息

Department of Pharmacy, Fuwai Central China Cardiovascular Hospital, 1 Fuwai Road, Zhengdong New District, Zhengzhou, China, 86 18538298379.

出版信息

JMIR Med Inform. 2025 Jul 24;13:e76128. doi: 10.2196/76128.

Abstract

BACKGROUND

Generative artificial intelligence (AI) systems are increasingly deployed in clinical pharmacy; yet, systematic evaluation of their efficacy, limitations, and risks across diverse practice scenarios remains limited.

OBJECTIVE

This study aims to quantitatively evaluate and compare the performance of 8 mainstream generative AI systems across 4 core clinical pharmacy scenarios-medication consultation, medication education, prescription review, and case analysis with pharmaceutical care-using a multidimensional framework.

METHODS

Forty-eight clinically validated questions were selected via stratified sampling from real-world sources (eg, hospital consultations, clinical case banks, and national pharmacist training databases). Three researchers simultaneously tested 8 different generative AI systems (ERNIE Bot, Doubao, Kimi, Qwen, GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet, and DeepSeek-R1) using standardized prompts within a single day (February 20, 2025). A double-blind scoring design was used, with 6 experienced clinical pharmacists (≥5 years experience) evaluating the AI responses across 6 dimensions: accuracy, rigor, applicability, logical coherence, conciseness, and universality, scored 0-10 per predefined criteria (eg, -3 for inaccuracy and -2 for incomplete rigor). Statistical analysis used one-way ANOVA with Tukey Honestly Significant Difference (HSD) post hoc testing and intraclass correlation coefficients (ICC) for interrater reliability (2-way random model). Qualitative thematic analysis identified recurrent errors and limitations.

RESULTS

DeepSeek-R1 (DeepSeek) achieved the highest overall performance (mean composite score: medication consultation 9.4, SD 1.0; case analysis 9.3, SD 1.0), significantly outperforming others in complex tasks (P<.05). Critical limitations were observed across models, including high-risk decision errors-75% omitted critical contraindications (eg, ethambutol in optic neuritis) and a lack of localization-90% erroneously recommended macrolides for drug-resistant Mycoplasma pneumoniae (China's high-resistance setting), while only DeepSeek-R1 aligned with updated American Academy of Pediatrics (AAP) guidelines for pediatric doxycycline. Complex reasoning deficits: only Claude-3.5-Sonnet detected a gender-diagnosis contradiction (prostatic hyperplasia in female); no model identified diazepam's 7-day prescription limit. Interrater consistency was lowest for conciseness in case analysis (ICC=0.70), reflecting evaluator disagreement on complex outputs. ERNIE Bot (Baidu) consistently underperformed (case analysis: 6.8, SD 1.5; P<.001 vs DeepSeek-R1).

CONCLUSIONS

While generative AI shows promise as a pharmacist assistance tool, significant limitations-including high-risk errors (eg, contraindication omissions), inadequate localization, and complex reasoning gaps-preclude autonomous clinical decision-making. Performance stratification highlights DeepSeek-R1's current advantage, but all systems require optimization in dynamic knowledge updating, complex scenario reasoning, and output interpretability. Future deployment must prioritize human oversight (human-AI co-review), ethical safeguards, and continuous evaluation frameworks.

摘要

背景

生成式人工智能(AI)系统在临床药学中的应用日益广泛;然而,在不同实践场景下对其有效性、局限性和风险的系统评估仍然有限。

目的

本研究旨在使用多维框架,对8种主流生成式AI系统在4个核心临床药学场景(用药咨询、用药教育、处方审核以及药物治疗案例分析)中的表现进行定量评估和比较。

方法

通过分层抽样从真实来源(如医院会诊、临床病例库和国家药师培训数据库)中选取了48个经过临床验证的问题。三名研究人员在同一天(2025年2月20日)使用标准化提示语同时测试8种不同的生成式AI系统(文心一言、豆包、智元机器、通义千问、GPT-4o、Gemini-1.5-Pro、Claude-3.5-Sonnet和渊亭晓数R1)。采用双盲评分设计,由6名经验丰富的临床药师(≥5年经验)对AI的回答在6个维度上进行评估:准确性、严谨性、适用性、逻辑连贯性、简洁性和通用性,每个预定义标准评分为0至10分(例如,不准确为-3分,严谨性不足为-2分)。统计分析采用单因素方差分析和Tukey真实显著性差异(HSD)事后检验,以及组内相关系数(ICC)来评估评分者间的可靠性(双向随机模型)。定性主题分析确定了反复出现的错误和局限性。

结果

渊亭晓数R1(DeepSeek)总体表现最佳(平均综合得分:用药咨询9.4,标准差1.0;案例分析9.3,标准差1.0),在复杂任务中显著优于其他系统(P<0.05)。各模型均存在关键局限性,包括高风险决策错误——75%遗漏关键禁忌证(如视神经炎患者使用乙胺丁醇)以及缺乏本地化——90%在耐多药支原体肺炎(中国高耐药情况)时错误推荐大环内酯类药物,而只有渊亭晓数R1符合美国儿科学会(AAP)关于儿童多西环素的最新指南。复杂推理缺陷:只有Claude-3.5-Sonnet检测到性别诊断矛盾(女性前列腺增生);没有模型识别出地西泮7天的处方限制。案例分析中简洁性的评分者间一致性最低(ICC = 0.70),反映出评估者对复杂输出存在分歧。文心一言(百度)表现一直较差(案例分析:6.8,标准差1.5;与渊亭晓数R1相比,P<0.001)。

结论

虽然生成式AI有望成为药师辅助工具,但存在重大局限性,包括高风险错误(如遗漏禁忌证)、本地化不足和复杂推理差距,这使得无法进行自主临床决策。性能分层突出了渊亭晓数R1目前的优势,但所有系统都需要在动态知识更新、复杂场景推理和输出可解释性方面进行优化。未来的部署必须优先考虑人工监督(人机联合审查)、道德保障和持续评估框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0210/12288765/2db1f9ae1d0c/medinform-v13-e76128-g001.jpg

相似文献

10
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

本文引用的文献

2
4
The Role of AI in Reshaping Medical Education: Opportunities and Challenges.
Clin Teach. 2025 Apr;22(2):e70040. doi: 10.1111/tct.70040.
5
Enhancing Multilingual Patient Education: ChatGPT's Accuracy and Readability for SSNHL Queries in English and Spanish.
OTO Open. 2024 Dec 11;8(4):e70048. doi: 10.1002/oto2.70048. eCollection 2024 Oct-Dec.
6
Trustworthy Artificial Intelligence in Dentistry: Learnings from the EU AI Act.
J Dent Res. 2024 Oct;103(11):1051-1056. doi: 10.1177/00220345241271160. Epub 2024 Sep 23.
7
AI-driven innovations in Alzheimer's disease: Integrating early diagnosis, personalized treatment, and prognostic modelling.
Ageing Res Rev. 2024 Nov;101:102497. doi: 10.1016/j.arr.2024.102497. Epub 2024 Sep 16.
8
Advancing Pharmacy Practice: The Role of Intelligence-Driven Pharmacy Practice and the Emergence of Pharmacointelligence.
Integr Pharm Res Pract. 2024 Aug 26;13:139-153. doi: 10.2147/IPRP.S466748. eCollection 2024.
9
Artificial Intelligence, the Digital Surgeon: Unravelling Its Emerging Footprint in Healthcare - The Narrative Review.
J Multidiscip Healthc. 2024 Aug 15;17:4011-4022. doi: 10.2147/JMDH.S482757. eCollection 2024.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验