ChatGPT-4在自身免疫性肝炎中是一种可靠的工具吗？

Is ChatGPT-4 a Reliable Tool in Autoimmune Hepatitis?

作者信息

Colapietro Francesca, Piovani Daniele, Pugliese Nicola, Aghemo Alessio, Ronca Vincenzo, Lleo Ana

机构信息

Department of Biomedical Sciences, Humanitas University, Milan, Italy.

IRCCS Humanitas Research Hospital, Department of Gastroenterology, Division of Internal Medicine and Hepatology, Milan, Italy.

出版信息

Am J Gastroenterol. 2025 Apr 1;120(4):914-919. doi: 10.14309/ajg.0000000000003179. Epub 2024 Oct 31.

DOI:10.14309/ajg.0000000000003179

PMID:39480023

Abstract

INTRODUCTION

Artificial intelligence-based chatbots offer a potential avenue for delivering personalized counseling to patients with autoimmune hepatitis. We assessed accuracy, completeness, comprehensiveness, and safety of Chat Generative Pretrained Transformer-4 responses to 12 inquiries out of a pool of 40 questions posed by 4 patients with autoimmune hepatitis.

METHODS

Questions were categorized into 3 areas: diagnosis (1-3), quality of life (4-8), and medical treatment (9-12). 11 key opinion leaders evaluated responses using a Likert scale with 6 points for accuracy, 5 points for safety, and 3 points for completeness and comprehensiveness.

RESULTS

Median scores for accuracy, completeness, comprehensiveness, and safety were 5 (4-6), 2 (2-2), and 3 (2-3), respectively; no domain exhibited superior evaluation. Postdiagnosis follow-up question was the trickiest with low accuracy and completeness, but safe and comprehensive features. Agreement among key opinion leaders (Fleiss Kappa statistics) was slight for the accuracy (0.05) but poor for the remaining features (-0.05, -0.06, and -0.02, respectively).

DISCUSSION

Chatbots show good comprehensibility, but lack reliability. Further studies are needed to integrate Chat Generative Pretrained Transformer within clinical practice.

摘要

引言

基于人工智能的聊天机器人为向自身免疫性肝炎患者提供个性化咨询提供了一条潜在途径。我们评估了Chat Generative Pretrained Transformer-4对4名自身免疫性肝炎患者提出的40个问题中的12个问题的回答的准确性、完整性、全面性和安全性。

方法

问题分为3个领域：诊断（1-3）、生活质量（4-8）和医疗治疗（9-12）。11位关键意见领袖使用李克特量表对回答进行评估，准确性为6分，安全性为5分，完整性和全面性为3分。

结果

准确性、完整性、全面性和安全性的中位数分数分别为5（4-6）、2（2-2）和3（2-3）；没有一个领域表现出卓越的评估。诊断后随访问题最难，准确性和完整性较低，但具有安全性和全面性特征。关键意见领袖之间的一致性（Fleiss Kappa统计）在准确性方面为轻微一致（0.05），但在其余特征方面为较差一致（分别为-0.05、-0.06和-0.02）。