Rewthamrongsris Paak, Burapacheep Jirayu, Phattarataratip Ekarat, Kulthanaamondhita Promphakkon, Tichy Antonin, Schwendicke Falk, Osathanon Thanaphum, Sappayatosok Kraisorn
Center of Artificial Intelligence and Innovation (CAII) and Center of Excellence for Dental Stem Cell Biology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand; Department of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, Germany.
Department of Computer Science, Stanford University, Stanford, California, USA.
Int Dent J. 2025 Jun 6;75(4):100848. doi: 10.1016/j.identj.2025.100848.
The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural networks (CNNs) constitute an alternative diagnostic modality. We evaluated the ability of seven LLMs, including both proprietary and open-source models, to detect OLP from intraoral images and generate differential diagnoses.
Using a dataset with 1,142 clinical photographs of histopathologically confirmed OLP, non-OLP lesions, and normal mucosa. The LLMs were tested using three experimental designs: zero-shot recognition, example-guided recognition, and differential diagnosis. Performance was measured using accuracy, precision, recall, F1-score, and discounted cumulative gain (DCG). Furthermore, the performance of LLMs was compared with three previously published CNN-based models for OLP detection on a subset of 110 photographs, which were previously used to test the CNN models.
Gemini 1.5 Pro and Flash demonstrated the highest accuracy (69.69%) in zero-shot recognition, whereas GPT-4o ranked first in the F1 score (76.10%). With example-guided prompts, which improved consistency and reduced refusal rates, Gemini 1.5 Flash achieved the highest accuracy (80.53%) and F1-score (84.54%); however, Claude 3.5 Sonnet achieved the highest DCG score of 0.63. Although the proprietary models generally excelled, the open-source Llama model demonstrated notable strengths in ranking relevant diagnoses despite moderate performance in detection tasks. All LLMs were outperformed by the CNN models.
The seven evaluated LLMs lack sufficient performance for clinical use. CNNs trained to detect OLP outperformed the LLMs tested in this study.
口腔扁平苔藓(OLP)是一种慢性口腔黏膜炎症性疾病,其与其他口腔病变的重叠特征给诊断带来了挑战。具有集成计算机视觉能力的大语言模型(LLMs)和卷积神经网络(CNNs)构成了一种替代诊断方式。我们评估了包括专有模型和开源模型在内的七种大语言模型从口腔内图像中检测OLP并生成鉴别诊断的能力。
使用一个包含1142张经组织病理学证实的OLP、非OLP病变和正常黏膜临床照片的数据集。大语言模型通过三种实验设计进行测试:零样本识别、示例引导识别和鉴别诊断。使用准确率、精确率、召回率、F1分数和折扣累积增益(DCG)来衡量性能。此外,在一个由110张照片组成的子集中,将大语言模型的性能与之前发表的三种基于卷积神经网络(CNN)的OLP检测模型进行比较;该子集照片先前用于测试CNN模型。
Gemini 1.5 Pro和Flash在零样本识别中表现出最高准确率(69.69%),而GPT-4o在F1分数方面排名第一(76.10%)。通过示例引导提示提高了一致性并降低了拒绝率后,Gemini 1.5 Flash实现了最高准确率(80.53%)和F1分数(84.54%);然而,Claude 3.5 Sonnet获得了最高DCG分数0.63。尽管专有模型总体上表现出色,但开源的Llama模型在对相关诊断进行排名方面表现出显著优势,尽管其在检测任务中的性能一般。所有大语言模型的表现均不如CNN模型。
所评估的七种大语言模型缺乏足够的临床使用性能。经过训练用于检测OLP的CNN模型在本研究中表现优于大语言模型。