文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

五种大语言模型在口腔辅助诊断、治疗及健康咨询领域的应用初探

[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

作者信息

Han C L, Bai S Z, Zhang T M, Liu C, Liu Y C, Hu X X, Zhao Y M

机构信息

Digital Center, School of Stomatology, The Fourth Military Medical University, State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, National Clinical Research Center for Oral Diseases, Shaanxi Key Laboratory of Stomatology, Xi'an 710032, ChinaHu Xiangxiang works at the Department of Orthodontics, the Ohio State University, Columbus 43210, U S A.

出版信息

Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.


DOI:10.3760/cma.j.cn112144-20241107-00418
PMID:40734399
Abstract

To evaluate the accuracy of the oral healthcare information provided by different large language models (LLM) to explore their feasibility and limitations in the application of oral auxiliary, treatment and health consultation. This study designed eight items comprising 47 questions in total related to the diagnosis and treatment of oral diseases [to assess the performance of LLM as an artificial intelligence (AI) medical assistant], and five items comprising 35 questions in total about oral health consultations (to assess the performance of LLM as a simulated doctor). These questions were answered individually by the five LLM models (Erine Bot, HuatuoGPT, Tongyi Qianwen, iFlytek Spark, ChatGPT). Two attending physicians with more than 5 years of experience independently rated the responses using the 3C criteria (correct, clear, concise), and the consistency between the raters was assessed using the Spearman rank correlation coefficient, and the Kruskal-Wallis test and Dunn post hoc test were used to assess the statistical differences between the models. Additionally, this study used 600 questions from the 2023 dental licensing examination to evaluate the time taken to answer, scores, and accuracy of each model. As an AI medical assistant, LLM can assist doctors in diagnosis and treatment decision-making, with an inter-evaluator Spearman coefficient of 0.505 (<0.01). As a simulated doctor, LLM can carry out patient popularization, with an inter-evaluator Spearman coefficient of 0.533 (<0.01). The 3C scoring results were represented by the median (lower quartile, upper quartile), and the 3C scores of each model as an AI medical assistant and a simulated doctor were respectively: 2.00 (1.00, 3.00) and 2.00 (1.00, 3.00) points of Erine Bot, 1.00 (1.00, 2.00) and 2.00 (1.00, 2.00) points of HuatuoGPT, 2.00 (1.00, 2.00) and 2.00 (1.00, 3.00) points of Tongyi Qianwen, 2.00 (1.00, 2.00) and 2.00 (1.75, 2.25) points of iFlytek Spark, 3.00 (2.00, 3.00) and 3.00 (2.00, 3.00) points of ChatGPT (full score of 4 points). The Kruskal-Wallis test results showed that, as an AI medical assistant or a simulated doctor, there were statistically differences in the 3C scores among the five large language models (all <0.001). The average score of the 5 LLMs on the dental licensing examination was 370.2, with an accuracy rate of 61.7% (370.2/600) and a time consumption of 94.6 minutes. Specifically, Erine Bot took 115 minutes, scored 363 points with an accuracy rate of 60.5% (363/600), HuatuoGPT took 224 minutes and scored 305 points with an accuracy rate of 50.8% (305/600), Tongyi Qianwen took 43 minutes, scored 438 points with an accuracy rate of 73.0% (480/600), iFlytek Spark took 32 minutes, scored 364 points with an accuracy rate of 60.7% (364/600), and ChatGPT took 59 minutes, scored 381 points with an accuracy rate of 63.5% (381/600). Based on the evaluation of LLM's dual roles as an AI medical assistant and a simulated doctor, ChatGPT performes the best, with basically correct, clear and concise answers, followed by Erine Bot, Tongyi Qianwen and iFlytek Spark, with HuatuoGPT lagging behind significantly. In the dental licensing examination, all the 4 LLM, except for HuatuoGPT, reach the passing level, and the time consumpution for answering is significantly reduced compared to the 8 h required for the exam regulations in all of the five models. LLM has the feasibility of application in oral auxiliary, treatment and health consultation, and it can help both doctors and patients obtain medical information quickly. Howere, their outputs carry a risk of errors (since the 3C scoring results do not reach the full marks), so prudent judgment should be exercised when using them.

摘要

为评估不同大语言模型(LLM)提供的口腔保健信息的准确性,探讨其在口腔辅助、治疗及健康咨询应用中的可行性和局限性。本研究设计了八项共47个与口腔疾病诊断和治疗相关的问题[以评估LLM作为人工智能(AI)医学助手的表现],以及五项共35个关于口腔健康咨询的问题(以评估LLM作为模拟医生的表现)。这些问题由五个LLM模型(文心一言、华医GPT、通义千问、科大讯飞星火、ChatGPT)分别回答。两名具有5年以上经验的主治医师使用3C标准(正确、清晰、简洁)对回答进行独立评分,使用Spearman等级相关系数评估评分者之间的一致性,使用Kruskal-Wallis检验和Dunn事后检验评估模型之间的统计学差异。此外,本研究使用了2023年牙科执照考试的600个问题来评估每个模型的回答时间、得分和准确性。作为AI医学助手,LLM可以协助医生进行诊断和治疗决策,评估者间Spearman系数为0.505(<0.01)。作为模拟医生,LLM可以进行患者科普,评估者间Spearman系数为0.533(<0.01)。3C评分结果用中位数(下四分位数,上四分位数)表示,每个模型作为AI医学助手和模拟医生的3C得分分别为:文心一言2.00(1.00,3.00)分和2.00(1.00,3.00)分,华医GPT 1.00(1.00,2.00)分和2.00(1.00,2.00)分,通义千问2.00(1.00,2.00)分和2.00(1.00,3.00)分,科大讯飞星火2.00(1.00,2.00)分和2.00(1.75,2.25)分,ChatGPT 3.00(2.00,3.)分和3.00(2.00,3.00)分(满分4分)。Kruskal-Wallis检验结果显示,作为AI医学助手或模拟医生,五个大语言模型的3C得分存在统计学差异(均<0.001)。5个LLM在牙科执照考试中的平均得分为370.2分,准确率为61.7%(370.2/600),耗时94.6分钟。具体而言,文心一言耗时115分钟,得363分,准确率为60.5%(363/600),华医GPT耗时224分钟,得305分,准确率为50.8%(305/600),通义千问耗时43分钟,得438分,准确率为73.0%(480/600),科大讯飞星火耗时32分钟,得364分,准确率为60.7%(364/600),ChatGPT耗时59分钟,得381分,准确率为63.5%(381/600)。基于对LLM作为AI医学助手和模拟医生双重角色的评估,ChatGPT表现最佳,回答基本正确、清晰、简洁,其次是文心一言、通义千问和科大讯飞星火,华医GPT明显落后。在牙科执照考试中,除华医GPT外,其他4个LLM均达到及格水平,且与考试规定的8小时相比,所有五个模型的回答耗时均显著减少。LLM在口腔辅助、治疗和健康咨询方面具有应用可行性,能够帮助医生和患者快速获取医疗信息。然而,其输出存在错误风险(因为3C评分结果未达到满分),因此使用时应谨慎判断。

相似文献

[1]
[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

Zhonghua Kou Qiang Yi Xue Za Zhi. 2025-7-30

[2]
Prescription of Controlled Substances: Benefits and Risks

2025-1

[3]
Comparing Artificial Intelligence and Senior Residents in Oral Lesion Diagnosis: A Comparative Study.

Cureus. 2024-1-3

[4]
Sexual Harassment and Prevention Training

2025-1

[5]
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024-12-1

[6]
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.

J Med Internet Res. 2025-6-4

[7]
Advancing health coaching: A comparative study of large language model and health coaches.

Artif Intell Med. 2024-11

[8]
Evaluation of ChatGPT-4 as an Online Outpatient Assistant in Puerperal Mastitis Management: Content Analysis of an Observational Study.

JMIR Med Inform. 2025-7-24

[9]
Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.

Radiol Technol. 2025

[10]
Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024-12-11

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索