Suppr超能文献

优化 ChatGPT 对谵妄评估结果的解释和报告:探索性研究。

Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.

机构信息

Department of Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, Pittsburgh, PA, United States.

Rory Meyers College of Nursing, New York University, New York, NY, United States.

出版信息

JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383.

Abstract

BACKGROUND

Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied.

OBJECTIVE

This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization.

METHODS

We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards.

RESULTS

Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive "Yes" or "No" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire.

CONCLUSIONS

Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.

摘要

背景

生成式人工智能(AI)和大型语言模型,如 OpenAI 的 ChatGPT,由于其庞大的知识库和自然语言处理能力,在支持医学教育和临床决策方面显示出了巨大的潜力。作为一种通用 AI 系统,ChatGPT 可以完成广泛的任务,包括无需额外培训即可进行鉴别诊断。然而,ChatGPT 在学习和应用一系列专门的、特定于上下文的任务方面的具体应用,例如管理标准化评估问卷,然后将评估结果输入标准化表格,并严格按照可靠的、已发表的评分标准解释评估结果,尚未得到深入研究。

目的

本探索性研究旨在评估和优化 ChatGPT 在管理和解释基于知情人的谵妄评估工具——Sour Seven 问卷方面的能力。具体而言,目标是通过提示工程训练 ChatGPT-3.5 和 ChatGPT-4 来理解和正确应用 Sour Seven 问卷对临床案例进行评估,评估这些 AI 模型在识别和为谵妄症状评分方面的表现,与人类专家的评分进行比较,并通过迭代提示优化来提高模型的解释和报告准确性。

方法

我们使用提示工程训练 ChatGPT-3.5 和 ChatGPT-4 模型,使其能够使用护理人员的输入来评估 Sour Seven 问卷。提示工程是一种通过精心构建提示来提高 AI 处理输入的准确性和一致性的方法,从而提高输出的准确性。在这项研究中,提示工程包括创建特定的、结构化的命令,这些命令指导 AI 模型准确地理解和应用评估工具的标准,以评估临床案例。这种方法还包括设计提示,明确指示 AI 如何格式化其响应,以确保其符合临床文档标准。

结果

尽管最初存在不一致和错误,但两个 ChatGPT 模型在应用 Sour Seven 问卷对案例进行评估方面都表现出了很高的熟练度。通过迭代提示工程,模型识别和评分的能力得到了显著提高。提示优化包括调整评分方法以仅接受明确的“是”或“否”的回答,修改评估提示以要求以表格形式回答,以及指导模型遵守 Sour Seven 问卷中规定的 2 项推荐行动。

结论

我们的研究结果初步证明了像 ChatGPT 这样的 AI 模型在管理标准化临床评估工具方面具有潜在的应用价值。结果强调了针对特定领域的培训和提示工程在充分发挥这些 AI 模型在医疗保健应用中的潜力方面的重要性。尽管结果令人鼓舞,但在真实环境中的更广泛的通用性和进一步验证仍需要进一步的研究。

相似文献

6
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.
7
Matching Human Expertise: ChatGPT's Performance on Hand Surgery Examinations.
Hand (N Y). 2025 Mar 20:15589447251322914. doi: 10.1177/15589447251322914.
10
A Road Map of Prompt Engineering for ChatGPT in Healthcare: A Perspective Study.
Stud Health Technol Inform. 2024 Aug 22;316:998-1002. doi: 10.3233/SHTI240578.

本文引用的文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
3
Artificial intelligence in healthcare: Complementing, not replacing, doctors and healthcare providers.
Digit Health. 2023 Jul 2;9:20552076231186520. doi: 10.1177/20552076231186520. eCollection 2023 Jan-Dec.
4
The Pros and Cons of Using ChatGPT in Medical Education: A Scoping Review.
Stud Health Technol Inform. 2023 Jun 29;305:644-647. doi: 10.3233/SHTI230580.
5
AI chatbots not yet ready for clinical use.
Front Digit Health. 2023 Apr 12;5:1161098. doi: 10.3389/fdgth.2023.1161098. eCollection 2023.
8
ChatGPT - Reshaping medical education and clinical management.
Pak J Med Sci. 2023 Mar-Apr;39(2):605-607. doi: 10.12669/pjms.39.2.7653.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验