Suppr超能文献

医生与人工智能的较量:横断面研究中对大语言模型回答风湿病患者问题的患者和医生评估。

Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study.

机构信息

University of Alberta, Edmonton, Alberta, Canada.

University Hospital Düsseldorf, Düsseldorf, Germany.

出版信息

Arthritis Rheumatol. 2024 Mar;76(3):479-484. doi: 10.1002/art.42737. Epub 2024 Jan 18.

Abstract

OBJECTIVE

The objective of the current study was to assess the quality of large language model (LLM) chatbot versus physician-generated responses to patient-generated rheumatology questions.

METHODS

We conducted a single-center cross-sectional survey of rheumatology patients (n = 17) in Edmonton, Alberta, Canada. Patients evaluated LLM chatbot versus physician-generated responses for comprehensiveness and readability, with four rheumatologists also evaluating accuracy by using a Likert scale from 1 to 10 (1 being poor, 10 being excellent).

RESULTS

Patients rated no significant difference between artificial intelligence (AI) and physician-generated responses in comprehensiveness (mean 7.12 ± SD 0.99 vs 7.52 ± 1.16; P = 0.1962) or readability (7.90 ± 0.90 vs 7.80 ± 0.75; P = 0.5905). Rheumatologists rated AI responses significantly poorer than physician responses on comprehensiveness (AI 5.52 ± 2.13 vs physician 8.76 ± 1.07; P < 0.0001), readability (AI 7.85 ± 0.92 vs physician 8.75 ± 0.57; P = 0.0003), and accuracy (AI 6.48 ± 2.07 vs physician 9.08 ± 0.64; P < 0.0001). The proportion of preference to AI- versus physician-generated responses by patients and physicians was 0.45 ± 0.18 and 0.15 ± 0.08, respectively (P = 0.0106). After learning that one answer for each question was AI generated, patients were able to correctly identify AI-generated answers at a lower proportion compared to physicians (0.49 ± 0.26 vs 0.97 ± 0.04; P = 0.0183). The average word count of AI answers was 69.10 ± 25.35 words, as compared to 98.83 ± 34.58 words for physician-generated responses (P = 0.0008).

CONCLUSION

Rheumatology patients rated AI-generated responses to patient questions similarly to physician-generated responses in terms of comprehensiveness, readability, and overall preference. However, rheumatologists rated AI responses significantly poorer than physician-generated responses, suggesting that LLM chatbot responses are inferior to physician responses, a difference that patients may not be aware of.

摘要

目的

本研究旨在评估大型语言模型(LLM)聊天机器人与医生针对患者提出的风湿学问题生成的回复的质量。

方法

我们在加拿大阿尔伯塔省埃德蒙顿进行了一项单中心横断面调查,纳入了 17 名风湿科患者。患者对 LLM 聊天机器人与医生生成的回复的全面性和可读性进行了评估,四位风湿病专家还使用 1 到 10 分的李克特量表(1 表示差,10 表示优)对准确性进行了评估。

结果

患者在全面性(人工智能 7.12±0.99 分与医生 7.52±1.16 分;P=0.1962)或可读性(人工智能 7.90±0.90 分与医生 7.80±0.75 分;P=0.5905)方面,并未发现人工智能和医生生成的回复之间有显著差异。风湿病专家对全面性(人工智能 5.52±2.13 分与医生 8.76±1.07 分;P<0.0001)、可读性(人工智能 7.85±0.92 分与医生 8.75±0.57 分;P=0.0003)和准确性(人工智能 6.48±2.07 分与医生 9.08±0.64 分;P<0.0001)的评价显著差于医生生成的回复。患者和医生分别对人工智能生成的回复和医生生成的回复更偏好的比例为 0.45±0.18 和 0.15±0.08(P=0.0106)。在得知每个问题的一个答案是由人工智能生成的后,患者能够正确识别出人工智能生成的答案的比例低于医生(0.49±0.26 与 0.97±0.04;P=0.0183)。人工智能答案的平均字数为 69.10±25.35 个单词,而医生生成的回复的平均字数为 98.83±34.58 个单词(P=0.0008)。

结论

在全面性、可读性和整体偏好方面,风湿科患者对人工智能生成的回复与医生生成的回复的评价相似。然而,风湿病专家对人工智能的回复评价明显差于医生生成的回复,这表明大型语言模型聊天机器人的回复不如医生生成的回复,而患者可能没有意识到这一点。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验