Suppr超能文献

评估 ChatGPT 在整个临床工作流程中的效用:开发和可用性研究。

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

机构信息

Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States.

Harvard Medical School, Boston, MA, United States.

出版信息

J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.

Abstract

BACKGROUND

Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated.

OBJECTIVE

This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.

METHODS

We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks.

RESULTS

ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types.

CONCLUSIONS

ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.

摘要

背景

基于大型语言模型(LLM)的人工智能聊天机器人将大型训练数据集的力量集中在连续的相关任务上,而不是人工智能已经取得令人印象深刻的性能的单一任务上。LLM 通过连续提示在迭代临床推理的各个方面提供帮助,实际上充当人工智能医生的能力尚未得到评估。

目的

本研究旨在通过其在标准化临床病例中的表现来评估 ChatGPT 在持续临床决策支持方面的能力。

方法

我们将默克手册(MSD)中的所有 36 个已发表的临床病例输入到 ChatGPT 中,并根据患者年龄、性别和病例严重程度比较其在鉴别诊断、诊断测试、最终诊断和管理方面的准确性。准确性通过人类评分者计算出的测试临床病例中提出的问题的正确答案比例来衡量。我们还进行了线性回归分析,以评估影响 ChatGPT 进行临床任务的因素。

结果

ChatGPT 在所有 36 个临床病例中的整体准确率为 71.7%(95%置信区间 69.3%-74.1%)。该语言模型在做出最终诊断方面表现最好,准确率为 76.9%(95%置信区间 67.8%-86.1%),在生成初始鉴别诊断方面表现最差,准确率为 60.3%(95%置信区间 54.2%-66.6%)。与回答一般医学知识问题相比,ChatGPT 在鉴别诊断(β=-15.8%;P<.001)和临床管理(β=-7.4%;P=.02)问题类型上的表现较差。

结论

ChatGPT 在临床决策制定方面取得了令人印象深刻的准确性,随着其获得更多的临床信息,其准确性也在不断提高。特别是,ChatGPT 在最终诊断任务中的准确性最高,而不是初始诊断。局限性包括可能的模型幻觉和 ChatGPT 的训练数据集的组成不清楚。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/3a1829727c2b/jmir_v25i1e48659_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验