大型语言模型在医疗咨询中的性能评估：比较研究

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study.

作者信息

Seo Sujeong, Kim Kyuli, Yang Heyoung

机构信息

Future Technology Analysis Center, Korea Institute of Science and Technology Information, Seoul, Republic of Korea.

Postal Savings & Insurance Development Institute, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2025 Feb 12;13:e64318. doi: 10.2196/64318.

DOI:10.2196/64318

PMID:39763114

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11888074/

Abstract

BACKGROUND

The recent introduction of generative artificial intelligence (AI) as an interactive consultant has sparked interest in evaluating its applicability in medical discussions and consultations, particularly within the domain of depression.

OBJECTIVE

This study evaluates the capability of large language models (LLMs) in AI to generate responses to depression-related queries.

METHODS

Using the PubMedQA and QuoraQA data sets, we compared various LLMs, including BioGPT, PMC-LLaMA, GPT-3.5, and Llama2, and measured the similarity between the generated and original answers.

RESULTS

The latest general LLMs, GPT-3.5 and Llama2, exhibited superior performance, particularly in generating responses to medical inquiries from the PubMedQA data set.

CONCLUSIONS

Considering the rapid advancements in LLM development in recent years, it is hypothesized that version upgrades of general LLMs offer greater potential for enhancing their ability to generate "knowledge text" in the biomedical domain compared with fine-tuning for the biomedical field. These findings are expected to contribute significantly to the evolution of AI-based medical counseling systems.

摘要

背景

近期生成式人工智能（AI）作为交互式咨询工具的引入，引发了人们对评估其在医学讨论和咨询中适用性的兴趣，尤其是在抑郁症领域。

目的

本研究评估人工智能中的大语言模型（LLMs）对抑郁症相关问题生成回答的能力。

方法

使用PubMedQA和QuoraQA数据集，我们比较了各种大语言模型，包括BioGPT、PMC-LLaMA、GPT-3.5和Llama2，并测量了生成答案与原始答案之间的相似度。

结果

最新的通用大语言模型GPT-3.5和Llama2表现出卓越的性能，尤其是在生成对PubMedQA数据集中医学问题的回答方面。

结论

考虑到近年来大语言模型发展的快速进步，据推测，与针对生物医学领域进行微调相比，通用大语言模型的版本升级在增强其在生物医学领域生成“知识文本”能力方面具有更大潜力。这些发现有望对基于人工智能的医学咨询系统的发展做出重大贡献。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7102/11888074/3b154431e251/medinform_v13i1e64318_fig1.jpg

相似文献

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study.大型语言模型在医疗咨询中的性能评估：比较研究

JMIR Med Inform. 2025 Feb 12;13:e64318. doi: 10.2196/64318.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较：评估研究。

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.大型语言模型在命名实体识别中的性能与可重复性：在受控环境中使用的考量

Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.

A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用：比较研究。

J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型：GPT-3.5、GPT-4 和 Bard 的比较分析。

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估：观察性比较案例研究

J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.

Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.评估生物医学微调对大语言模型在临床任务上的有效性。

J Am Med Inform Assoc. 2025 Jun 1;32(6):1015-1024. doi: 10.1093/jamia/ocaf045.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量：评估研究

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination.插件增强版ChatGPT的性能及其量化不确定性的能力：德国医学委员会考试的模拟研究

JMIR Med Educ. 2025 Mar 21;11:e58375. doi: 10.2196/58375.

引用本文的文献

Promoting trust and intention to adopt health information generated by ChatGPT among healthcare customers: An empirical study.促进医疗保健客户对ChatGPT生成的健康信息的信任和采用意愿：一项实证研究。

Digit Health. 2025 Aug 28;11:20552076251374121. doi: 10.1177/20552076251374121. eCollection 2025 Jan-Dec.

Leveraging large language models for automated depression screening.利用大语言模型进行自动抑郁症筛查。

PLOS Digit Health. 2025 Jul 28;4(7):e0000943. doi: 10.1371/journal.pdig.0000943. eCollection 2025 Jul.

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.用于医学问答集成学习的大语言模型协同作用：设计与评估研究

J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.

本文引用的文献

Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答

Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

PMC-LLaMA: toward building open-source language models for medicine.PMC-LLaMA：为医学构建开源语言模型的努力。

J Am Med Inform Assoc. 2024 Sep 1;31(9):1833-1843. doi: 10.1093/jamia/ocae045.

Safety of Large Language Models in Addressing Depression.大语言模型在应对抑郁症方面的安全性。

Cureus. 2023 Dec 18;15(12):e50729. doi: 10.7759/cureus.50729. eCollection 2023 Dec.

The future landscape of large language models in medicine.医学领域大语言模型的未来前景。

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

Large AI Models in Health Informatics: Applications, Challenges, and the Future.大语言模型在健康信息学中的应用、挑战与未来

IEEE J Biomed Health Inform. 2023 Dec;27(12):6074-6087. doi: 10.1109/JBHI.2023.3316750. Epub 2023 Dec 5.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Large language model AI chatbots require approval as medical devices.大型语言模型人工智能聊天机器人需作为医疗设备获得批准。

Nat Med. 2023 Oct;29(10):2396-2398. doi: 10.1038/s41591-023-02412-6.

Assessing the Accuracy and Clinical Utility of ChatGPT in Laboratory Medicine.评估ChatGPT在检验医学中的准确性和临床实用性。

Clin Chem. 2023 Aug 2;69(8):939-940. doi: 10.1093/clinchem/hvad058.

Depression in Central and Eastern Europe: How Much It Costs? Cost of Depression in Romania.中东欧地区的抑郁症：代价几何？罗马尼亚抑郁症的成本。

Healthcare (Basel). 2023 Mar 22;11(6):921. doi: 10.3390/healthcare11060921.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大型语言模型在医疗咨询中的性能评估：比较研究

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献