• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

机构信息

School of Dentistry, European University Cyprus, Nicosia, Cyprus.

Information Management Systems Institute, ATHENA Research and Innovation Center, Athens, Greece.

出版信息

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

DOI:10.2196/51580
PMID:38009003
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10784979/
Abstract

BACKGROUND

The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy.

OBJECTIVE

This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry.

METHODS

The LLMs were queried with 20 open-type, clinical dentistry-related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs' answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs' answers.

RESULTS

Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate.

CONCLUSIONS

This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist's critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies.

摘要

背景

生成式人工智能大型语言模型(LLM)在包括牙科在内的各个领域的应用日益广泛,这引发了人们对其准确性的质疑。

目的

本研究旨在比较评估 4 种 LLM,即 Bard(Google LLC)、ChatGPT-3.5 和 ChatGPT-4(OpenAI)以及 Bing Chat(Microsoft Corp),对来自牙科领域的临床相关问题的回答。

方法

由塞浦路斯欧洲大学牙科学院的教师分别开发了 20 个开放式、临床牙科相关问题,对 4 种 LLM 进行查询。使用评分表(如果这些问题是向学生提出的考试问题),由 2 名经验丰富的教师根据强有力的传统收集的科学证据(如指南和共识声明)对 LLM 的回答进行 0(最低)到 10(最高)分的评分。使用 Friedman 和 Wilcoxon 检验对评分进行统计学比较,以确定表现最佳的模型。此外,评估者被要求对 LLM 回答的全面性、科学准确性、清晰度和相关性进行定性评估。

结果

总体而言,2 名评估者给出的评分之间没有统计学上的显著差异;因此,为每个 LLM 计算了平均评分。尽管 ChatGPT-4 在统计学上优于 ChatGPT-3.5(P=.008)、Bing Chat(P=.049)和 Bard(P=.045),但所有模型偶尔都存在不准确、笼统、过时的内容和缺乏来源参考的情况。评估者注意到 LLM 提供不相关信息、模糊答案或不完全准确的信息的情况。

结论

本研究表明,尽管 LLM 作为实施循证牙科的辅助工具具有很大的潜力,但如果使用不当,它们当前的局限性可能会导致潜在的有害医疗保健决策。因此,这些工具不应替代牙医对主题的批判性思维和深入理解。为了使这些工具完全融入牙科实践,需要进行进一步的研究、临床验证和模型改进。牙科从业者必须意识到 LLM 的局限性,因为它们的不当使用可能会对患者护理产生影响。应建立监管措施来监督这些不断发展的技术的使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c070/10784979/abb20802d1d6/jmir_v25i1e51580_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c070/10784979/abb20802d1d6/jmir_v25i1e51580_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c070/10784979/abb20802d1d6/jmir_v25i1e51580_fig1.jpg

相似文献

1
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.
2
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
3
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.利用人工智能在减重手术中的应用:ChatGPT-4、Bing 和 Bard 在生成临床医生水平的减重手术建议方面的比较分析。
Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24.
4
Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.大语言模型(ChatGPT、必应搜索和谷歌巴德)在解决生理学病例 vignettes 中的表现。
Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.
5
Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.大语言模型在血液学病例解决中的应用:ChatGPT-3.5、谷歌巴德和微软必应的比较研究
Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.
6
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
7
Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.大型语言模型在造血干细胞移植导航中对医疗保健专业人员和患者的实用性:ChatGPT-3.5、ChatGPT-4 和 Bard 的性能比较。
J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.
8
Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.大语言模型在多专科招聘评估(MSRA)考试中的表现评估。
Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.
9
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
10
Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.聊天生成预训练转换器(ChatGPT)和巴德:人工智能尚未为髋和膝关节骨关节炎提供临床支持的答案。
J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.

引用本文的文献

1
Evaluating ChatGPT's Utility in Biologic Therapy for Systemic Lupus Erythematosus: Comparative Study of ChatGPT and Google Web Search.评估ChatGPT在系统性红斑狼疮生物治疗中的效用:ChatGPT与谷歌网络搜索的比较研究
JMIR Form Res. 2025 Aug 28;9:e76458. doi: 10.2196/76458.
2
Evaluation of deepseek, gemini, ChatGPT-4o, and perplexity in responding to salivary gland cancer.评估DeepSeek、Gemini、ChatGPT-4o和Perplexity对涎腺癌的回答。
BMC Oral Health. 2025 Aug 23;25(1):1358. doi: 10.1186/s12903-025-06726-4.
3
Leveraging large language models to inform paediatric chronic condition care: a cross-sectional study.

本文引用的文献

1
The potential of ChatGPT in oral medicine: a new era of patient care?ChatGPT在口腔医学中的潜力:患者护理的新时代?
Oral Surg Oral Med Oral Pathol Oral Radiol. 2024 Jan;137(1):1-2. doi: 10.1016/j.oooo.2023.09.010. Epub 2023 Oct 5.
2
The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: a narrative review.大语言模型(如 ChatGPT)对口颌外科的影响和机遇:叙述性综述。
Int J Oral Maxillofac Surg. 2024 Jan;53(1):78-88. doi: 10.1016/j.ijom.2023.09.005. Epub 2023 Oct 3.
3
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.
利用大语言模型为儿科慢性病护理提供信息:一项横断面研究。
BMJ Paediatr Open. 2025 Aug 14;9(1):e003742. doi: 10.1136/bmjpo-2025-003742.
4
Evaluating the accuracy of generative artificial intelligence models in dental age estimation based on the Demirjian's method.基于德米尔坚方法评估生成式人工智能模型在牙龄估计中的准确性。
Front Dent Med. 2025 Jul 29;6:1634006. doi: 10.3389/fdmed.2025.1634006. eCollection 2025.
5
Using large language models to generate child-friendly education materials on myopia.使用大语言模型生成适合儿童的近视教育材料。
Digit Health. 2025 Jul 30;11:20552076251362338. doi: 10.1177/20552076251362338. eCollection 2025 Jan-Dec.
6
Could a New Method of Acromiohumeral Distance Measurement Emerge? Artificial Intelligence vs. Physician.能否出现一种新的肩峰肱骨距离测量方法?人工智能与医生的较量。
J Imaging Inform Med. 2025 Jul 25. doi: 10.1007/s10278-025-01614-3.
7
Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry.口腔修复学和牙体修复学中聊天机器人对基于文本的多项选择题的回答评估
Dent J (Basel). 2025 Jun 21;13(7):279. doi: 10.3390/dj13070279.
8
Evaluation of ChatGPT-4 as an Online Outpatient Assistant in Puerperal Mastitis Management: Content Analysis of an Observational Study.评估ChatGPT-4作为产褥期乳腺炎管理在线门诊助手的效果:一项观察性研究的内容分析
JMIR Med Inform. 2025 Jul 24;13:e68980. doi: 10.2196/68980.
9
Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.大语言模型在围手术期医学中的临床和经济影响:一项随机交叉试验
NPJ Digit Med. 2025 Jul 21;8(1):462. doi: 10.1038/s41746-025-01858-x.
10
Generative artificial intelligence in cardiovascular specialty care: a scoping review.心血管专科护理中的生成式人工智能:一项范围综述
BMC Nurs. 2025 Jul 19;24(1):947. doi: 10.1186/s12912-025-03594-9.
ChatGPT 让医学文献通俗易懂:简化放射学报告的探索性案例研究。
Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.
4
ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model.ChatGPT 塑造牙科的未来:多模态大语言模型的潜力。
Int J Oral Sci. 2023 Jul 28;15(1):29. doi: 10.1038/s41368-023-00239-y.
5
Transforming dentistry with ChatGPT: A guide to optimizing patient care.利用ChatGPT变革牙科:优化患者护理指南。
J Am Dent Assoc. 2024 Apr;155(4):273-274. doi: 10.1016/j.adaj.2023.06.003. Epub 2023 Jul 21.
6
The Potential Usefulness of ChatGPT in Oral and Maxillofacial Radiology.ChatGPT在口腔颌面放射学中的潜在用途
Cureus. 2023 Jul 19;15(7):e42133. doi: 10.7759/cureus.42133. eCollection 2023 Jul.
7
Utility of ChatGPT in Clinical Practice.ChatGPT 在临床实践中的应用。
J Med Internet Res. 2023 Jun 28;25:e48568. doi: 10.2196/48568.
8
Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.评估 GPT 作为放射学决策辅助工具:GPT-4 与 GPT-3.5 在乳腺成像试点中的比较。
J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
9
ChatGPT in Dentistry: A Comprehensive Review.牙科领域的ChatGPT:全面综述。
Cureus. 2023 Apr 30;15(4):e38317. doi: 10.7759/cureus.38317. eCollection 2023 Apr.
10
Responsible Use of Artificial Intelligence in Dentistry: Survey on Dentists' and Final-Year Undergraduates' Perspectives.牙科中人工智能的合理使用:关于牙医和本科最后一年学生观点的调查
Healthcare (Basel). 2023 May 19;11(10):1480. doi: 10.3390/healthcare11101480.