Suppr超能文献

多模态人工智能聊天机器人在临床肿瘤病例中的性能评估。

Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology Cases.

机构信息

Radiation Medicine Program, Princess Margaret Hospital Cancer Centre, Toronto, Ontario, Canada.

Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.

出版信息

JAMA Netw Open. 2024 Oct 1;7(10):e2437711. doi: 10.1001/jamanetworkopen.2024.37711.

Abstract

IMPORTANCE

Multimodal artificial intelligence (AI) chatbots can process complex medical image and text-based information that may improve their accuracy as a clinical diagnostic and management tool compared with unimodal, text-only AI chatbots. However, the difference in medical accuracy of multimodal and text-only chatbots in addressing questions about clinical oncology cases remains to be tested.

OBJECTIVE

To evaluate the utility of prompt engineering (zero-shot chain-of-thought) and compare the competency of multimodal and unimodal AI chatbots to generate medically accurate responses to questions about clinical oncology cases.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study benchmarked the medical accuracy of multiple-choice and free-text responses generated by AI chatbots in response to 79 questions about clinical oncology cases with images.

EXPOSURES

A unique set of 79 clinical oncology cases from JAMA Network Learning accessed on April 2, 2024, was posed to 10 AI chatbots.

MAIN OUTCOMES AND MEASURES

The primary outcome was medical accuracy evaluated by the number of correct responses by each AI chatbot. Multiple-choice responses were marked as correct based on the ground-truth, correct answer. Free-text responses were rated by a team of oncology specialists in duplicate and marked as correct based on consensus or resolved by a review of a third oncology specialist.

RESULTS

This study evaluated 10 chatbots, including 3 multimodal and 7 unimodal chatbots. On the multiple-choice evaluation, the top-performing chatbot was chatbot 10 (57 of 79 [72.15%]), followed by the multimodal chatbot 2 (56 of 79 [70.89%]) and chatbot 5 (54 of 79 [68.35%]). On the free-text evaluation, the top-performing chatbots were chatbot 5, chatbot 7, and the multimodal chatbot 2 (30 of 79 [37.97%]), followed by chatbot 10 (29 of 79 [36.71%]) and chatbot 8 and the multimodal chatbot 3 (25 of 79 [31.65%]). The accuracy of multimodal chatbots decreased when tested on cases with multiple images compared with questions with single images. Nine out of 10 chatbots, including all 3 multimodal chatbots, demonstrated decreased accuracy of their free-text responses compared with multiple-choice responses to questions about cancer cases.

CONCLUSIONS AND RELEVANCE

In this cross-sectional study of chatbot accuracy tested on clinical oncology cases, multimodal chatbots were not consistently more accurate than unimodal chatbots. These results suggest that further research is required to optimize multimodal chatbots to make more use of information from images to improve oncology-specific medical accuracy and reliability.

摘要

重要性

多模态人工智能(AI)聊天机器人可以处理复杂的医学图像和基于文本的信息,这可能会提高它们作为临床诊断和管理工具的准确性,与仅基于文本的 AI 聊天机器人相比。然而,多模态和仅基于文本的聊天机器人在回答临床肿瘤病例问题上的医学准确性差异仍有待检验。

目的

评估提示工程(零样本思维链)的效用,并比较多模态和单模态 AI 聊天机器人生成医学上准确回答临床肿瘤病例问题的能力。

设计、设置和参与者:这项横断面研究以 JAMA 网络学习上的 79 个具有图像的临床肿瘤病例为基准,对 10 个 AI 聊天机器人生成的多项选择和自由文本回答的医学准确性进行了基准测试。

暴露

2024 年 4 月 2 日,通过 JAMA 网络学习访问了一组独特的 79 个临床肿瘤病例,这些病例被提交给了 10 个 AI 聊天机器人。

主要结果和措施

主要结果是通过每个 AI 聊天机器人的正确回答数量评估的医学准确性。多项选择回答根据基础事实、正确答案进行标记。自由文本回答由肿瘤学专家小组进行重复评估,并根据共识标记为正确,或由第三位肿瘤学专家进行审查以解决。

结果

本研究评估了 10 个聊天机器人,包括 3 个多模态和 7 个单模态聊天机器人。在多项选择评估中,表现最好的聊天机器人是聊天机器人 10(79 题中的 57 题[72.15%]),其次是多模态聊天机器人 2(79 题中的 56 题[70.89%])和聊天机器人 5(79 题中的 54 题[68.35%])。在自由文本评估中,表现最好的聊天机器人是聊天机器人 5、聊天机器人 7 和多模态聊天机器人 2(79 题中的 30 题[37.97%]),其次是聊天机器人 10(79 题中的 29 题[36.71%])和聊天机器人 8 和多模态聊天机器人 3(79 题中的 25 题[31.65%])。当测试多模态聊天机器人处理多个图像的病例时,它们的准确性会低于处理单个图像的问题。10 个聊天机器人中有 9 个,包括 3 个多模态聊天机器人,在回答癌症病例问题时,其自由文本回答的准确性都低于多项选择回答。

结论和相关性

在这项对临床肿瘤病例进行的聊天机器人准确性的横断面研究中,多模态聊天机器人并不总是比单模态聊天机器人更准确。这些结果表明,需要进一步研究来优化多模态聊天机器人,以更有效地利用图像信息,提高肿瘤学特定的医学准确性和可靠性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9a2/11581577/90d7d617ccce/jamanetwopen-e2437711-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验