Seo Sujeong, Kim Kyuli, Yang Heyoung
Future Technology Analysis Center, Korea Institute of Science and Technology Information, Seoul, Republic of Korea.
Postal Savings & Insurance Development Institute, Seoul, Republic of Korea.
JMIR Med Inform. 2025 Feb 12;13:e64318. doi: 10.2196/64318.
The recent introduction of generative artificial intelligence (AI) as an interactive consultant has sparked interest in evaluating its applicability in medical discussions and consultations, particularly within the domain of depression.
This study evaluates the capability of large language models (LLMs) in AI to generate responses to depression-related queries.
Using the PubMedQA and QuoraQA data sets, we compared various LLMs, including BioGPT, PMC-LLaMA, GPT-3.5, and Llama2, and measured the similarity between the generated and original answers.
The latest general LLMs, GPT-3.5 and Llama2, exhibited superior performance, particularly in generating responses to medical inquiries from the PubMedQA data set.
Considering the rapid advancements in LLM development in recent years, it is hypothesized that version upgrades of general LLMs offer greater potential for enhancing their ability to generate "knowledge text" in the biomedical domain compared with fine-tuning for the biomedical field. These findings are expected to contribute significantly to the evolution of AI-based medical counseling systems.
近期生成式人工智能(AI)作为交互式咨询工具的引入,引发了人们对评估其在医学讨论和咨询中适用性的兴趣,尤其是在抑郁症领域。
本研究评估人工智能中的大语言模型(LLMs)对抑郁症相关问题生成回答的能力。
使用PubMedQA和QuoraQA数据集,我们比较了各种大语言模型,包括BioGPT、PMC-LLaMA、GPT-3.5和Llama2,并测量了生成答案与原始答案之间的相似度。
最新的通用大语言模型GPT-3.5和Llama2表现出卓越的性能,尤其是在生成对PubMedQA数据集中医学问题的回答方面。
考虑到近年来大语言模型发展的快速进步,据推测,与针对生物医学领域进行微调相比,通用大语言模型的版本升级在增强其在生物医学领域生成“知识文本”能力方面具有更大潜力。这些发现有望对基于人工智能的医学咨询系统的发展做出重大贡献。