Kim Kyuhyung, Mun Sae Byeol, Kim Young Jae, Kim Bong Chul, Kim Kwang Gi
Department of Oral and Maxillofacial Surgery, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon, Republic of Korea.
Department of Health Sciences and Technology, GAIHST, Gachon University, Incheon, Republic of Korea.
PLoS One. 2025 May 28;20(5):e0322529. doi: 10.1371/journal.pone.0322529. eCollection 2025.
In this study, we aim to evaluate the ability of large language models (LLM) to generate questions and answers in oral and maxillofacial surgery.
ChatGPT4, ChatGPT4o, and Claude3-Opus were evaluated in this study. Each LLM was instructed to generate 50 questions about oral and maxillofacial surgery. Three LLMs were asked to answer the generated 150 questions.
All 150 questions generated by the three LLMs were related to oral and maxillofacial surgery. Each model exhibited a correct answer rate of over 90%. None of the three models were able to answer correctly all the questions they generated themselves. The correct answer rate was 97.0% for questions with figures, significantly higher than the 88.9% rate for questions without figures. The analysis of problem-solving by the three LLMs showed that each model generally inferred answers with high accuracy, and there were few logical errors that could be considered controversial. Additionally, all three scored above 88% for the fidelity of their explanations.
This study demonstrates that while LLMs like ChatGPT4, ChatGPT4o, and Claude3-Opus exhibit robust capabilities in generating and solving oral and maxillofacial surgery questions, their performance is not without limitations. None of the models were able to answer correctly all the questions they generated themselves, highlighting persistent challenges such as AI hallucinations and contextual understanding gaps. The results also emphasize the importance of multimodal inputs, as questions with annotated images achieved higher accuracy rates compared to text-only prompts. Despite these shortcomings, the LLMs showed significant promise in problem-solving, logical consistency, and response fidelity, particularly in structured or numerical contexts.
在本研究中,我们旨在评估大语言模型(LLM)在口腔颌面外科领域生成问题及答案的能力。
本研究对ChatGPT4、ChatGPT4o和Claude3-Opus进行了评估。每个大语言模型被要求生成50个关于口腔颌面外科的问题。随后要求这三个大语言模型回答所生成的150个问题。
这三个大语言模型生成的所有150个问题均与口腔颌面外科相关。每个模型的正确答案率均超过90%。这三个模型均无法正确回答它们自己生成的所有问题。有附图问题的正确答案率为97.0%,显著高于无附图问题88.9%的正确率。对这三个大语言模型解决问题的分析表明,每个模型通常能以较高的准确率推断答案,几乎没有可被视为有争议的逻辑错误。此外,所有三个模型的解释保真度得分均高于88%。
本研究表明,虽然像ChatGPT4、ChatGPT4o和Claude3-Opus这样的大语言模型在生成和解决口腔颌面外科问题方面展现出强大能力,但其表现并非毫无局限。没有一个模型能够正确回答它们自己生成的所有问题,这凸显了诸如人工智能幻觉和上下文理解差距等持续存在的挑战。研究结果还强调了多模态输入的重要性,因为带有注释图像的问题相比纯文本提示能实现更高的准确率。尽管存在这些不足,大语言模型在解决问题、逻辑一致性和回答保真度方面,尤其是在结构化或数值背景下,显示出了巨大的潜力。