• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估智能聊天机器人性能的工具的有效性和可靠性:人工智能性能评估工具(AIPI)。

Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).

作者信息

Lechien Jerome R, Maniaci Antonino, Gengler Isabelle, Hans Stephane, Chiesa-Estomba Carlos M, Vaira Luigi A

机构信息

Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France.

Young Confederation of the European Oto-Rhino-Laryngological Head and Neck Surgery Societies (Y-CEORLHNS), Dublin, Ireland.

出版信息

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2063-2079. doi: 10.1007/s00405-023-08219-y. Epub 2023 Sep 12.

DOI:10.1007/s00405-023-08219-y
PMID:37698703
Abstract

OBJECTIVES

To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI).

METHODS

Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test-retest reliability. Internal consistency was evaluated using Cronbach's α. Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall's tau.

RESULTS

Forty-five patients completed the evaluations (28 females). The AIPI Cronbach's alpha analysis suggested an adequate internal consistency (α = 0.754). The test-retest reliability was moderate-to-strong for items and the total score of AIPI (r = 0.486, p = 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity (p = 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT (r = 0.319; p = 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score.

CONCLUSIONS

AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields.

摘要

目的

评估人工智能性能评估工具(AIPI)的可靠性和有效性。

方法

医生和ChatGPT对耳鼻喉科就诊患者的病历进行鉴别诊断、管理和治疗评估。在7天内使用AIPI对ChatGPT的表现进行两次评分,以评估重测信度。使用Cronbach's α评估内部一致性。通过比较ChatGPT和两名不知情的从业者对临床病例的AIPI评分来评估内部效度。通过将AIPI评分与渥太华临床评估工具(OCAT)的修改版进行比较来测量收敛效度。使用肯德尔tau系数评估评分者间信度。

结果

45名患者完成了评估(28名女性)。AIPI的Cronbach's α分析表明内部一致性良好(α = 0.754)。AIPI各项及总分的重测信度为中度至高度(r = 0.486,p = 0.001)。高级耳鼻喉科医生的AIPI平均得分显著高于ChatGPT的得分,支持了足够的内部效度(p = 0.001)。收敛效度表明AIPI与修改后的OCAT之间存在中度且显著的相关性(r = 0.319;p = 0.044)。评分者间信度表明,两名耳鼻喉科医生在患者特征、诊断、额外检查和治疗子分数以及AIPI总分方面存在显著的正一致性。

结论

AIPI是评估ChatGPT在耳鼻喉疾病中表现的有效且可靠的工具。未来需要开展研究,以调查AIPI在医学和外科中的实用性,并评估这些领域的心理测量特性。

相似文献

1
Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).评估智能聊天机器人性能的工具的有效性和可靠性:人工智能性能评估工具(AIPI)。
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2063-2079. doi: 10.1007/s00405-023-08219-y. Epub 2023 Sep 12.
2
ChatGPT-4 Consistency in Interpreting Laryngeal Clinical Images of Common Lesions and Disorders.ChatGPT-4 对常见病变和疾病的喉部临床图像的解释一致性。
Otolaryngol Head Neck Surg. 2024 Oct;171(4):1106-1113. doi: 10.1002/ohn.897. Epub 2024 Jul 24.
3
Performance and Consistency of ChatGPT-4 Versus Otolaryngologists: A Clinical Case Series.ChatGPT-4与耳鼻喉科医生的表现及一致性:临床病例系列
Otolaryngol Head Neck Surg. 2024 Jun;170(6):1519-1526. doi: 10.1002/ohn.759. Epub 2024 Apr 9.
4
ChatGPT performance in laryngology and head and neck surgery: a clinical case-series.ChatGPT 在喉科学和头颈外科学中的应用:一项临床病例系列研究。
Eur Arch Otorhinolaryngol. 2024 Jan;281(1):319-333. doi: 10.1007/s00405-023-08282-5. Epub 2023 Oct 24.
5
Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced.大型语言模型在高级头颈部恶性肿瘤管理中的可靠性:ChatGPT 4 与 Gemini Advanced 之间的比较。
Eur Arch Otorhinolaryngol. 2024 Sep;281(9):5001-5006. doi: 10.1007/s00405-024-08746-2. Epub 2024 May 25.
6
ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology-head and neck surgery.ChatGPT 与 UpToDate:在耳鼻喉头颈外科常见临床情况下比较评估 Chatbot 的有用性和可靠性的研究。
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2145-2151. doi: 10.1007/s00405-023-08423-w. Epub 2024 Jan 13.
7
Cross-Cultural Adaption and Psychometric Evaluation of the German Craniofacial Pain and Disability Inventory (CF-PDI).跨文化调适与心理计量评估的德国颅面疼痛和残疾量表 (CF-PDI)。
Pain Physician. 2021 Sep;24(6):E857-E866.
8
Validity and Reliability of the Reflux Sign Assessment-10 (RSA-10).反流症状评估-10 量表(RSA-10)的效度和信度。
Laryngoscope. 2024 Sep;134(9):3981-3988. doi: 10.1002/lary.31420. Epub 2024 Mar 29.
9
Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument.ChatGPT 提供的医学信息的可靠性:与临床指南和患者信息质量工具的评估。
J Med Internet Res. 2023 Jun 30;25:e47479. doi: 10.2196/47479.
10
Examining the Performance of ChatGPT 3.5 and Microsoft Copilot in Otolaryngology: A Comparative Study with Otolaryngologists' Evaluation.评估ChatGPT 3.5和Microsoft Copilot在耳鼻喉科的表现:与耳鼻喉科医生评估的对比研究
Indian J Otolaryngol Head Neck Surg. 2024 Aug;76(4):3465-3469. doi: 10.1007/s12070-024-04729-1. Epub 2024 May 1.

引用本文的文献

1
Diagnostic Performance of ChatGPT-4o in Analyzing Oral Mucosal Lesions: A Comparative Study with Experts.ChatGPT-4o在分析口腔黏膜病变中的诊断性能:与专家的比较研究
Medicina (Kaunas). 2025 Jul 30;61(8):1379. doi: 10.3390/medicina61081379.
2
The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability.大语言模型(LLM)人工智能(AI)聊天机器人的运动营养知识:准确性、完整性、清晰度、证据质量及重测信度评估
PLoS One. 2025 Jun 13;20(6):e0325982. doi: 10.1371/journal.pone.0325982. eCollection 2025.
3

本文引用的文献

1
Is the evolving role of artificial intelligence and chatbots in the field of otolaryngology embracing the future?人工智能和聊天机器人在耳鼻喉科领域不断演变的作用是否正引领着未来?
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2179-2180. doi: 10.1007/s00405-023-08382-2. Epub 2023 Dec 22.
2
Continuous intraoperative neuromonitoring of the facial nerve predicts postoperative facial palsy in parotid surgery: a prospective study.术中对面神经进行连续神经监测可预测腮腺手术术后面神经麻痹:一项前瞻性研究。
Eur Arch Otorhinolaryngol. 2024 Mar;281(3):1483-1492. doi: 10.1007/s00405-023-08384-0. Epub 2023 Dec 21.
3
The Pros and Cons of Using ChatGPT in Medical Education: A Scoping Review.
Clinical decision support using large language models in otolaryngology: a systematic review.
耳鼻喉科中使用大语言模型的临床决策支持:一项系统综述。
Eur Arch Otorhinolaryngol. 2025 Jun 6. doi: 10.1007/s00405-025-09504-8.
4
Assessing the value of artificial intelligence-based image analysis for pre-operative surgical planning of neck dissections and iENE detection in head and neck cancer patients.评估基于人工智能的图像分析对头颈部癌患者颈部 dissection 术前手术规划及 iENE 检测的价值。
Discov Oncol. 2025 May 30;16(1):956. doi: 10.1007/s12672-025-02798-4.
5
Applications of Natural Language Processing in Otolaryngology: A Scoping Review.自然语言处理在耳鼻咽喉科的应用:一项范围综述
Laryngoscope. 2025 Sep;135(9):3049-3063. doi: 10.1002/lary.32198. Epub 2025 May 1.
6
Advancing Clinical Chatbot Validation Using AI-Powered Evaluation With a New 3-Bot Evaluation System: Instrument Validation Study.使用具有新型三机器人评估系统的人工智能驱动评估推进临床聊天机器人验证:工具验证研究
JMIR Nurs. 2025 Feb 27;8:e63058. doi: 10.2196/63058.
7
A radiopathomics model for predicting large-number cervical lymph node metastasis in clinical N0 papillary thyroid carcinoma.一种用于预测临床N0期乳头状甲状腺癌大量颈部淋巴结转移的放射组学模型。
Eur Radiol. 2025 Jan 29. doi: 10.1007/s00330-025-11377-8.
8
Artificial intelligence for image recognition in diagnosing oral and oropharyngeal cancer and leukoplakia.用于诊断口腔和口咽癌及白斑病的图像识别人工智能。
Sci Rep. 2025 Jan 29;15(1):3625. doi: 10.1038/s41598-025-85920-4.
9
Enhancing Multilingual Patient Education: ChatGPT's Accuracy and Readability for SSNHL Queries in English and Spanish.加强多语言患者教育:ChatGPT对英文和西班牙文突发性聋查询的准确性和可读性
OTO Open. 2024 Dec 11;8(4):e70048. doi: 10.1002/oto2.70048. eCollection 2024 Oct-Dec.
10
Harnessing the Power of ChatGPT in Cardiovascular Medicine: Innovations, Challenges, and Future Directions.利用ChatGPT在心血管医学中的力量:创新、挑战与未来方向。
J Clin Med. 2024 Oct 31;13(21):6543. doi: 10.3390/jcm13216543.
使用 ChatGPT 在医学教育中的利弊:范围综述。
Stud Health Technol Inform. 2023 Jun 29;305:644-647. doi: 10.3233/SHTI230580.
4
Understanding computational dialogue understanding.理解计算对话理解。
Philos Trans A Math Phys Eng Sci. 2023 Jul 24;381(2251):20220049. doi: 10.1098/rsta.2022.0049. Epub 2023 Jun 5.
5
A Chat(GPT) about the future of scientific publishing.关于科学出版未来的一场Chat(GPT)讨论。
Brain Behav Immun. 2023 May;110:152-154. doi: 10.1016/j.bbi.2023.02.022. Epub 2023 Mar 1.
6
How Far Can Conversational Agents Contribute to IBD Patient Health Care-A Review of the Literature.对话代理对炎症性肠病患者医疗保健的贡献有多大——文献综述
Front Public Health. 2022 Jun 30;10:862432. doi: 10.3389/fpubh.2022.862432. eCollection 2022.
7
Is the rating result reliable? A new approach to respond to a medical trainee's concerns about the reliability of Mini-CEX assessment.评分结果可靠吗?一种回应医学实习生对迷你临床演练评估可靠性担忧的新方法。
J Formos Med Assoc. 2022 May;121(5):943-949. doi: 10.1016/j.jfma.2021.07.005. Epub 2021 Jul 19.
8
Factors Influencing Mini-CEX Rater Judgments and Their Practical Implications: A Systematic Literature Review.影响迷你临床演练评估者判断的因素及其实际意义:一项系统文献综述
Acad Med. 2017 Jun;92(6):880-887. doi: 10.1097/ACM.0000000000001537.
9
A New Instrument for Assessing Resident Competence in Surgical Clinic: The Ottawa Clinic Assessment Tool.一种用于评估外科门诊住院医师能力的新工具:渥太华门诊评估工具。
J Surg Educ. 2016 Jul-Aug;73(4):575-82. doi: 10.1016/j.jsurg.2016.02.003. Epub 2016 Apr 1.
10
Tools for direct observation and assessment of clinical skills of medical trainees: a systematic review.用于直接观察和评估医学实习生临床技能的工具:一项系统综述
JAMA. 2009 Sep 23;302(12):1316-26. doi: 10.1001/jama.2009.1365.