AlShenaiber Abdullah, Datta Shaishav, Mosa Adam J, Binhammer Paul A, Ing Edsel B
Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
Division of Plastic, Reconstructive & Aesthetic Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada.
J Hand Surg Glob Online. 2024 Sep 3;6(6):847-854. doi: 10.1016/j.jhsg.2024.07.011. eCollection 2024 Nov.
Tools using artificial intelligence may help reduce missed or delayed diagnoses and improve patient care in hand surgery. This study aimed to compare and evaluate the performance of two natural language processing programs, Isabel and ChatGPT-4, in diagnosing hand and peripheral nerve injuries from a set of clinical vignettes.
Cases from a virtual library of hand surgery case reports with no history of trauma or previous surgery were included in this study. The clinical details (age, sex, symptoms, signs, and medical history) of 16 hand cases were entered into Isabel and ChatGPT-4 to generate top 10 differential diagnosis lists. Isabel and ChatGPT-4's inclusion and median rank of the correct diagnosis within each list were compared. Two hand surgeons were then provided each list and asked to independently evaluate the performance of the two systems.
Isabel correctly identified 7/16 (44%) cases with a median rank of two (interquartile range = 3). ChatGPT-4 correctly identified 14/16 (88%) of cases with a median rank of one (interquartile range = 1). Physicians one and two, respectively, preferred the lists generated by ChatGPT-4 in 12/16 (75%) and 13/16 (81%) of cases and had no preference in 2/16 (13%) cases.
ChatGPT-4 had significantly greater diagnostic accuracy within our sample ( < .05) and generated higher quality differential diagnoses than Isabel. Isabel produced several inappropriate and imprecise differential diagnoses.
Despite large language models' potential utility in generating medical diagnoses, physicians must continue to exercise high caution and use their clinical judgment when making diagnostic decisions.
使用人工智能的工具可能有助于减少手部手术中漏诊或误诊的情况,并改善患者护理。本研究旨在比较和评估两种自然语言处理程序Isabel和ChatGPT-4在根据一组临床病例诊断手部和周围神经损伤方面的性能。
本研究纳入了来自手部手术病例报告虚拟库的病例,这些病例无创伤史或既往手术史。将16例手部病例的临床细节(年龄、性别、症状、体征和病史)输入Isabel和ChatGPT-4,以生成前10名的鉴别诊断列表。比较Isabel和ChatGPT-4在每个列表中正确诊断的纳入情况和中位排名。然后向两名手外科医生提供每个列表,并要求他们独立评估这两个系统的性能。
Isabel正确识别了7/16(44%)的病例,中位排名为第二(四分位间距 = 3)。ChatGPT-4正确识别了14/16(88%)的病例,中位排名为第一(四分位间距 = 1)。医生一和医生二分别在12/16(75%)和13/16(81%)的病例中更喜欢ChatGPT-4生成的列表,在2/16(13%)的病例中没有偏好。
在我们的样本中,ChatGPT-4具有显著更高的诊断准确性(P <.05),并且比Isabel生成了更高质量的鉴别诊断。Isabel产生了一些不恰当和不准确的鉴别诊断。
尽管大语言模型在生成医学诊断方面具有潜在效用,但医生在做出诊断决策时必须继续高度谨慎并运用临床判断力。