Mount Sinai Health System, New York, USA.
Hospital Management, Sheba Medical Center, Affiliated to Tel-Aviv University, Tel Aviv, Israel.
Sci Rep. 2024 Jul 28;14(1):17341. doi: 10.1038/s41598-024-66933-x.
This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5's responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5's effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.
本研究旨在评估不同的提示工程技术,特别是直接提示、思维链(CoT)和改进的 CoT 方法,如何影响 GPT-3.5 回答临床和基于计算的医学问题的能力,特别是那些模仿 USMLE Step 1 考试的问题。为了实现这一目标,我们分析了 GPT-3.5 对两组不同问题的回答:一组由 GPT-4 生成的 1000 个问题,另一组包含 95 个真实的 USMLE Step 1 问题。这些问题涵盖了各种领域和难度级别的医学计算和临床场景。我们的分析表明,当使用直接提示、CoT 或改进的 CoT 方法时,GPT-3.5 的回答准确性没有显著差异。例如,在 USMLE 样本中,直接提示的成功率为 61.7%,CoT 为 62.8%,改进的 CoT 为 57.4%,p 值为 0.734。在对 GPT-4 生成的问题的回答中也观察到了类似的趋势,无论是临床问题还是基于计算的问题,p 值均高于 0.05,表明提示类型之间没有显著差异。从这项研究中得出的结论是,使用 CoT 提示工程不会显著改变 GPT-3.5 处理医学计算或模仿 USMLE 考试问题的临床场景问题的有效性。这一发现至关重要,因为它表明 ChatGPT 的性能保持一致,无论是否使用 CoT 技术代替直接提示。这种一致性可以简化将 AI 工具(如 ChatGPT)集成到医学教育中的过程,使医疗保健专业人员能够轻松地使用这些工具,而无需进行复杂的提示工程。