Abuabara Allan, do Nascimento Thais Vilalba Paniagua Machado, Trentini Seandra Maria, Costa Gonçalves Angela Mairane, Hueb de Menezes-Oliveira Maria Angélica, Madalena Isabela Ribeiro, Beisel-Memmert Svenja, Kirschneck Christian, Antunes Livia Azeredo Alves, Miranda de Araujo Cristiano, Baratto-Filho Flares, Küchler Erika Calvano
Post-Graduation Program in Health and Environment, University from the Joinville Region - Univille, Joinville, Brazil.
School of Dentistry, Tuiuti University of Paraná - UTP, Curitiba, Brazil.
Front Dent Med. 2025 Jul 29;6:1634006. doi: 10.3389/fdmed.2025.1634006. eCollection 2025.
Dental age estimation plays a key role in forensic identification, clinical diagnosis, treatment planning, and prognosis in fields such as pediatric dentistry and orthodontics. Large language models (LLM) are increasingly being recognized for their potential applications in Dentistry. This study aimed to compare the performance of currently available generative artificial intelligence LLM technologies in estimating dental age using the Demirjian's scores.
Panoramic radiographs were analyzed using Demirjian's method (1973), with each left permanent mandibular tooth classified from stage A to H. Untrained LLM, ChatGPT (GPT-4-turbo), Gemini 2.0 Flash, and DeepSeek-V3 were tasked with estimating dental age based on the patient's Demirjian score for each tooth. Due to the probabilistic nature of ChatGPT, Gemini, and DeepSeek, which can produce varying responses to the same question, three responses were collected per case per day (three different computers) from each model on three separate days. The age estimates obtained from LLM were compared to the individuals' chronological ages. Intra- and inter-examiner reliability was assessed using the Intraclass Correlation Coefficient (ICC). Model performance was evaluated using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination ( ), and Bias.
Thirty panoramic radiographs (40% female, 60% male; mean age 10.4 ± 2.32 years) were included. Both intra- and inter-examiner ICC values exceeded 0.85. ChatGPT and DeepSeek exhibited comparable but suboptimal performance, with higher errors (MAE: 1.98-2.05 years; RMSE: 2.33-2.35 years), negative values (-0.069 to -0.049), and substantial overestimation biases (1.90-1.91 years), indicating poor model fit and systematic flaws. Gemini demonstrated intermediate results, with a moderate MAE (1.57 years) and RMSE (1.81 years), a positive (0.367), and a lower bias (1.32 years).
This study demonstrated that, although LLM like ChatGPT, Gemini, and DeepSeek can estimate dental age using Demirjian's scores, their performance remains inferior to the traditional method. Among them, DeepSeek-V3 showed the best results, but all models require task-specific training and validation before clinical application.
牙齿年龄估计在法医鉴定、临床诊断、治疗计划以及儿科牙科和正畸等领域的预后评估中起着关键作用。大语言模型(LLM)在牙科领域的潜在应用越来越受到认可。本研究旨在比较目前可用的生成式人工智能LLM技术在使用德米尔坚评分法估计牙齿年龄方面的性能。
使用德米尔坚方法(1973年)分析全景X线片,将每颗左侧下颌恒牙从A期到H期进行分类。未经过训练的LLM、ChatGPT(GPT - 4 - turbo)、Gemini 2.0 Flash和DeepSeek - V3被要求根据患者每颗牙齿的德米尔坚评分来估计牙齿年龄。由于ChatGPT、Gemini和DeepSeek具有概率性,对同一个问题可能会产生不同的回答,因此在三天内每天从每个模型的三个不同计算机上针对每个病例收集三个回答。将从LLM获得的年龄估计值与个体的实际年龄进行比较。使用组内相关系数(ICC)评估检查者内和检查者间的可靠性。使用平均绝对误差(MAE)、均方根误差(RMSE)、决定系数( )和偏差来评估模型性能。
纳入了30张全景X线片(40%为女性,60%为男性;平均年龄10.4±2.32岁)。检查者内和检查者间的ICC值均超过0.85。ChatGPT和DeepSeek表现出相当但次优的性能,误差较高(MAE:1.98 - 2.05岁;RMSE:2.33 - 2.35岁), 值为负(-0.069至-0.049),且存在明显的高估偏差(1.90 - 1.91岁),表明模型拟合不佳和存在系统缺陷。Gemini表现出中等结果,MAE为1.57岁,RMSE为1.81岁, 为正(0.367),偏差较低(1.32岁)。
本研究表明,尽管像ChatGPT、Gemini和DeepSeek这样的LLM可以使用德米尔坚评分法估计牙齿年龄,但其性能仍不如传统方法。其中,DeepSeek - V3表现出最好的结果,但所有模型在临床应用前都需要进行特定任务的训练和验证。