Pradhan P
15, Trauma Centre, District Hospital Neemuch Madhya Pradesh - 458441, India
Med Oral Patol Oral Cir Bucal. 2025 Mar 1;30(2):e224-e231. doi: 10.4317/medoral.26824.
The accurate and timely diagnosis of oral potentially malignant lesions (OPMLs) is crucial for effective management and prevention of oral cancer. Recent advancements in artificial intelligence technologies indicates its potential to assist in clinical decision-making. Hence, this study was carried out with the aim to evaluate and compare the diagnostic accuracy of ChatGPT 3.5, 4.0, 4o and Gemini in identifying OPMLs.
The analysis was carried out using 42 case reports from PubMed, Scopus and Google Scholar and images from two datasets, corresponding to different OPMLs. The reports were inputted separately for text description-based diagnosis in GPT 3.5, 4.0, 4o and Gemini, and for image recognition-based diagnosis in GPT 4o and Gemini. Two subject-matter experts independently reviewed the reports and offered their evaluations.
For text-based diagnosis, among LLMs, GPT 4o got the maximum number of correct responses (27/42), followed by GPT 4.0 (20/42), GPT 3.5 (18/42) and Gemini (15/42). In identifying OPMLs based on image, GPT 4o demonstrated better performance than Gemini. There was fair to moderate agreement found between Large Language Models (LLMs) and subject experts. None of the LLMs matched the accuracy of the subject experts in identifying the correct number of lesions.
The results point towards cautious optimism with respect to commonly used LLMs in diagnosing OPMLs. While their potential in diagnostic applications is undeniable, their integration should be approached judiciously.
准确及时地诊断口腔潜在恶性病变(OPMLs)对于口腔癌的有效管理和预防至关重要。人工智能技术的最新进展表明其有协助临床决策的潜力。因此,本研究旨在评估和比较ChatGPT 3.5、4.0、4o和Gemini在识别OPMLs方面的诊断准确性。
分析使用了来自PubMed、Scopus和谷歌学术的42例病例报告以及来自两个数据集的对应不同OPMLs的图像。这些报告分别输入到GPT 3.5、4.0、4o和Gemini中进行基于文本描述的诊断,以及输入到GPT 4o和Gemini中进行基于图像识别的诊断。两位主题专家独立审查报告并给出评估。
对于基于文本的诊断,在大型语言模型(LLMs)中,GPT 4o得到的正确回答数量最多(27/42),其次是GPT 4.0(20/42)、GPT 3.5(18/42)和Gemini(15/42)。在基于图像识别OPMLs方面,GPT 4o的表现优于Gemini。大型语言模型(LLMs)与主题专家之间的一致性为中等。在识别病变正确数量方面,没有一个大型语言模型(LLMs)能与主题专家的准确性相匹配。
结果表明对于常用的大型语言模型(LLMs)在诊断OPMLs方面应持谨慎乐观态度。虽然它们在诊断应用中的潜力不可否认,但应谨慎对待其整合。