Provenzano Gina, McCahon Joseph A S, Nghe Amy, Lencer Adam, Amponsah Nana, Daniel Joseph N, Pedowitz David I, Parekh Selene G
Thomas Jefferson University Hospital, Philadelphia, Pennsylvania.
Jefferson Health NJ, Stratford, New Jersey.
Foot Ankle Spec. 2025 Aug 20:19386400251363000. doi: 10.1177/19386400251363000.
BackgroundBilling and coding for orthopaedic procedures is a complex process with thousands of procedure codes and associated modifiers in existence. Foot and ankle faces an additional challenge as it is among the highest variability regarding procedures performed compared with other orthopaedic subspecialities. This study aimed to investigate the capabilities of the top AI search engines in accurately identifying Current Procedural Terminology (CPT) codes for common foot and ankle procedures.MethodsA comparative analysis of 3 publically available AI search engines (ChatGPT, Bing, and Google Gemini) was performed investigating their accuracy in generating CPT codes for common orthopaedic foot and ankle procedures. The generated CPT codes were recorded and compared with the codes generated by 3 fellowship trained foot and ankle surgeons, serving as the reference standard. Cohen kappa coefficient was used to determine agreement across AI platforms regarding the surgeon coding reference standard. ResultsThe AI search engines were able to correctly generate the appropriate CPT codes 44% of the time, with Bing being the most accurate in generating the correct CPT codes for 8 of the 13 procedures (62%) and partially correct codes 3 of the 13 procedures (23%). ChatGPT demonstrated the worst accuracy, generating the correct CPT codes only 23% of the time (3/13). AI platforms demonstrated an overall Fair Agreement with the reference standard (kappa = 0.201). Individually, Bing demonstrated Moderate Agreement (kappa = 0.405), Google Gemini demonstrated Fair Agreement (kappa = 0.255), and ChatGPT demonstrated Poor Agreement with the reference standard (kappa = 0.171).ConclusionAlthough the capabilities of AI show great promise for many industries, the results of this study bring caution to relying on AI for accurately generating orthopaedic foot and ankle procedure CPT codes.Level of Evidence:.
背景
骨科手术的计费和编码是一个复杂的过程,现有数千个手术代码及相关修饰符。与其他骨科亚专业相比,足踝手术面临着额外的挑战,因为其手术的变异性是最高的。本研究旨在调查顶级人工智能搜索引擎准确识别常见足踝手术当前手术术语(CPT)代码的能力。
方法
对3个公开可用的人工智能搜索引擎(ChatGPT、必应和谷歌Gemini)进行了比较分析,研究它们生成常见骨科足踝手术CPT代码的准确性。记录生成的CPT代码,并与3名经过专科培训的足踝外科医生生成的代码进行比较,后者作为参考标准。使用科恩kappa系数来确定人工智能平台之间关于外科医生编码参考标准的一致性。
结果
人工智能搜索引擎能够在44%的时间内正确生成适当的CPT代码,必应在13种手术中的8种(62%)生成正确CPT代码方面最为准确,在13种手术中的3种(23%)生成部分正确代码。ChatGPT的准确性最差,仅在23%的时间内(3/13)生成正确的CPT代码。人工智能平台与参考标准总体显示出中等一致性(kappa = 0.201)。单独来看,必应显示出中等一致性(kappa = 0.405),谷歌Gemini显示出中等一致性(kappa = 0.255),ChatGPT与参考标准显示出较差的一致性(kappa = 0.171)。
结论
尽管人工智能的能力在许多行业显示出巨大潜力,但本研究结果提醒人们在依赖人工智能准确生成骨科足踝手术CPT代码时要谨慎。证据水平: