Suppr超能文献

大型语言模型在患者关于牙龈和牙髓健康问题中的全面性

Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health.

作者信息

Zhang Qian, Wu Zhengyu, Song Jinlin, Luo Shuicai, Chai Zhaowu

机构信息

College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China.

Quanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, China.

出版信息

Int Dent J. 2025 Feb;75(1):151-157. doi: 10.1016/j.identj.2024.06.022. Epub 2024 Aug 14.

Abstract

AIM

Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types.

METHODS

We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses.

RESULTS

LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann-Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann-Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann-Whitney U test, P < .05).

CONCLUSIONS

ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings.

CLINICAL RELEVANCE

This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries.

摘要

目的

鉴于使用大语言模型(LLMs)进行自我诊断的兴趣日益浓厚,本研究旨在评估两个著名的大语言模型ChatGPT-3.5和ChatGPT-4在不同语言环境和查询类型下,回答与牙龈和牙髓健康相关常见问题的全面性。

方法

我们收集了一组33个与牙龈和牙髓保健相关的常见现实生活问题,包括17个常识性问题和16个专家问题。每个问题都以英文和中文呈现给大语言模型。邀请三位专家以五点李克特量表评估回答的全面性,分数越高表明回答质量越高。

结果

大语言模型在英文回答上表现显著更好,平均分为4.53,而中文回答平均分为3.95(曼-惠特尼U检验,P < 0.05)。常识性问题的回答得分高于专家问题,平均分分别为4.46和4.02(曼-惠特尼U检验,P < 0.05)。在大语言模型中,ChatGPT-4始终优于ChatGPT-3.5,平均得分分别为4.45和4.03(曼-惠特尼U检验,P < 0.05)。

结论

对于与牙龈和牙髓健康相关的问题,ChatGPT-4比ChatGPT-3.5提供更全面的回答。两个大语言模型在英文和常识性问题上表现更好。然而,不同语言环境下的性能差异以及不准确回答的存在表明,进一步评估和了解它们的局限性对于避免潜在误解至关重要。

临床意义

本研究揭示了ChatGPT-3.5和ChatGPT-4在处理不同语言环境下牙龈和牙髓健康问题时的性能差异,为大语言模型在回答常见口腔保健问题时的全面性和局限性提供了见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1574/11806297/6460768c0095/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验