Colapietro Francesca, Piovani Daniele, Pugliese Nicola, Aghemo Alessio, Ronca Vincenzo, Lleo Ana
Department of Biomedical Sciences, Humanitas University, Milan, Italy.
IRCCS Humanitas Research Hospital, Department of Gastroenterology, Division of Internal Medicine and Hepatology, Milan, Italy.
Am J Gastroenterol. 2025 Apr 1;120(4):914-919. doi: 10.14309/ajg.0000000000003179. Epub 2024 Oct 31.
Artificial intelligence-based chatbots offer a potential avenue for delivering personalized counseling to patients with autoimmune hepatitis. We assessed accuracy, completeness, comprehensiveness, and safety of Chat Generative Pretrained Transformer-4 responses to 12 inquiries out of a pool of 40 questions posed by 4 patients with autoimmune hepatitis.
Questions were categorized into 3 areas: diagnosis (1-3), quality of life (4-8), and medical treatment (9-12). 11 key opinion leaders evaluated responses using a Likert scale with 6 points for accuracy, 5 points for safety, and 3 points for completeness and comprehensiveness.
Median scores for accuracy, completeness, comprehensiveness, and safety were 5 (4-6), 2 (2-2), and 3 (2-3), respectively; no domain exhibited superior evaluation. Postdiagnosis follow-up question was the trickiest with low accuracy and completeness, but safe and comprehensive features. Agreement among key opinion leaders (Fleiss Kappa statistics) was slight for the accuracy (0.05) but poor for the remaining features (-0.05, -0.06, and -0.02, respectively).
Chatbots show good comprehensibility, but lack reliability. Further studies are needed to integrate Chat Generative Pretrained Transformer within clinical practice.
基于人工智能的聊天机器人为向自身免疫性肝炎患者提供个性化咨询提供了一条潜在途径。我们评估了Chat Generative Pretrained Transformer-4对4名自身免疫性肝炎患者提出的40个问题中的12个问题的回答的准确性、完整性、全面性和安全性。
问题分为3个领域:诊断(1-3)、生活质量(4-8)和医疗治疗(9-12)。11位关键意见领袖使用李克特量表对回答进行评估,准确性为6分,安全性为5分,完整性和全面性为3分。
准确性、完整性、全面性和安全性的中位数分数分别为5(4-6)、2(2-2)和3(2-3);没有一个领域表现出卓越的评估。诊断后随访问题最难,准确性和完整性较低,但具有安全性和全面性特征。关键意见领袖之间的一致性(Fleiss Kappa统计)在准确性方面为轻微一致(0.05),但在其余特征方面为较差一致(分别为-0.05、-0.06和-0.02)。
聊天机器人表现出良好的可理解性,但缺乏可靠性。需要进一步研究将Chat Generative Pretrained Transformer整合到临床实践中。