Lechien Jerome R, Maniaci Antonino, Gengler Isabelle, Hans Stephane, Chiesa-Estomba Carlos M, Vaira Luigi A
Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France.
Young Confederation of the European Oto-Rhino-Laryngological Head and Neck Surgery Societies (Y-CEORLHNS), Dublin, Ireland.
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2063-2079. doi: 10.1007/s00405-023-08219-y. Epub 2023 Sep 12.
To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI).
Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test-retest reliability. Internal consistency was evaluated using Cronbach's α. Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall's tau.
Forty-five patients completed the evaluations (28 females). The AIPI Cronbach's alpha analysis suggested an adequate internal consistency (α = 0.754). The test-retest reliability was moderate-to-strong for items and the total score of AIPI (r = 0.486, p = 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity (p = 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT (r = 0.319; p = 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score.
AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields.
评估人工智能性能评估工具(AIPI)的可靠性和有效性。
医生和ChatGPT对耳鼻喉科就诊患者的病历进行鉴别诊断、管理和治疗评估。在7天内使用AIPI对ChatGPT的表现进行两次评分,以评估重测信度。使用Cronbach's α评估内部一致性。通过比较ChatGPT和两名不知情的从业者对临床病例的AIPI评分来评估内部效度。通过将AIPI评分与渥太华临床评估工具(OCAT)的修改版进行比较来测量收敛效度。使用肯德尔tau系数评估评分者间信度。
45名患者完成了评估(28名女性)。AIPI的Cronbach's α分析表明内部一致性良好(α = 0.754)。AIPI各项及总分的重测信度为中度至高度(r = 0.486,p = 0.001)。高级耳鼻喉科医生的AIPI平均得分显著高于ChatGPT的得分,支持了足够的内部效度(p = 0.001)。收敛效度表明AIPI与修改后的OCAT之间存在中度且显著的相关性(r = 0.319;p = 0.044)。评分者间信度表明,两名耳鼻喉科医生在患者特征、诊断、额外检查和治疗子分数以及AIPI总分方面存在显著的正一致性。
AIPI是评估ChatGPT在耳鼻喉疾病中表现的有效且可靠的工具。未来需要开展研究,以调查AIPI在医学和外科中的实用性,并评估这些领域的心理测量特性。