Zhan Zheng-Zhe, Xiong Yu-Tao, Wang Chen-Yuan, Zhang Bao-Tian, Lian Wen-Jun, Zeng Yu-Min, Liu Wei, Tang Wei, Liu Chang
State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, 610041, China.
Sci Rep. 2025 Feb 12;15(1):5187. doi: 10.1038/s41598-025-89328-y.
The aim of this study is to evaluate GPT-4's reasoning ability to interpret oral mucosal disease photos and generate structured reports from free-text inputs, while exploring the role of prompt engineering in enhancing its performance. Prompt received by utilizing automatic prompt engineering and knowledge of oral physicians, was provided to GPT-4 for generating structured reports based on cases of oral mucosal disease. The structured reports included 7 fine-grained items: "location", "shape", "number", "size", "clinical manifestation", "the border of the lesion" and "diagnosis". 120 cases were used for testing, which were divided into two datasets, textbook dataset and internet dataset. Oral physicians evaluated GPT-4's responses by confusion matrices, receiving recall and accuracy. ANOVA and Wald χ2 tests with Bonferroni correction were used to statistical analysis. A total of 120 cases of oral mucosal diseases were included, encompassing the following two datasets: textbook dataset (n = 60), internet dataset (n = 60). GPT-4 had higher recall with the textbook dataset compared to the internet dataset (90.73% vs 89.12%; P = .462, χ = 0.54) and higher accuracy (87.05% vs 84.87%; P = .393, χ = 0.73). Performance varied by items of structured reports within each dataset, with "size" achieving the highest accuracy in the textbook dataset (98.90%) and "the border the lesion" in the internet dataset (95.00%). GPT-4 can transform incomplete descriptive text corresponding to oral mucosal disease photographs into structured reports with the assistance of carefully designed prompts. This study highlights GPT-4's potential in complex and multimodal medical tasks and underscores the importance of prompt engineering in optimizing its capabilities. Nevertheless, achieving further improvements in the model may require more comprehensive and focused efforts. This article demonstrated the capabilities of large multimodal models, represented by GPT-4, in medical photographs interpretation and medical report generation. GPT-4 was capable of recognizing photographs of oral mucosal diseases and generating structured reports, which facilitates telemedicine and peer-to-peer communication.
本研究的目的是评估GPT-4解读口腔黏膜疾病照片并从自由文本输入生成结构化报告的推理能力,同时探索提示工程在提高其性能方面的作用。利用自动提示工程和口腔医生的知识生成的提示被提供给GPT-4,以根据口腔黏膜疾病病例生成结构化报告。结构化报告包括7个细粒度项目:“位置”、“形状”、“数量”、“大小”、“临床表现”、“病变边界”和“诊断”。120个病例用于测试,分为两个数据集,教科书数据集和互联网数据集。口腔医生通过混淆矩阵、召回率和准确率评估GPT-4的回答。采用方差分析和经Bonferroni校正的Wald χ2检验进行统计分析。总共纳入了120例口腔黏膜疾病病例,包括以下两个数据集:教科书数据集(n = 60),互联网数据集(n = 60)。与互联网数据集相比,GPT-4在教科书数据集上具有更高的召回率(90.73%对89.12%;P = 0.462,χ = 0.54)和更高的准确率(87.05%对84.87%;P = 0.393,χ = 0.73)。每个数据集内结构化报告的项目性能各不相同,“大小”在教科书数据集中准确率最高(98.90%),“病变边界”在互联网数据集中准确率最高(95.00%)。GPT-4可以在精心设计的提示的帮助下,将与口腔黏膜疾病照片对应的不完整描述性文本转换为结构化报告。本研究突出了GPT-4在复杂多模态医学任务中的潜力,并强调了提示工程在优化其能力方面的重要性。然而,要在模型上实现进一步改进可能需要更全面和有针对性的努力。本文展示了以GPT-4为代表的大型多模态模型在医学照片解读和医学报告生成方面的能力。GPT-4能够识别口腔黏膜疾病照片并生成结构化报告,这有助于远程医疗和点对点交流。