Revercomb Lucy, Patel Aman M, Fu Daniel, Filimonov Andrey
Department of Otolaryngology - Head and Neck Surgery, Rutgers New Jersey Medical School, 185 S Orange Ave Newark, Newark, NJ 07103 USA.
Indian J Otolaryngol Head Neck Surg. 2024 Dec;76(6):6112-6114. doi: 10.1007/s12070-024-04935-x. Epub 2024 Aug 3.
GPT-4, recently released by OpenAI, improves upon GPT-3.5 with increased reliability and expanded capabilities, including user-specified, customizable GPT-4 models. This study aims to investigate updates in GPT-4 performance vs. GPT-3.5 on Otolaryngology board-style questions.
150 Otolaryngology board-style questions were obtained from the BoardVitals question bank. These questions, which were previously assessed with GPT-3.5, were inputted into standard GPT-4 and a custom GPT-4 model designed to specialize in Otolaryngology board-style questions, emphasize precision, and provide evidence-based explanations.
Standard GPT-4 correctly answered 72.0% and custom GPT-4 correctly answered 81.3% of the questions, vs. GPT-3.5 which answered 51.3% of the same questions correctly. On multivariable analysis, custom GPT-4 had higher odds of correctly answering questions than standard GPT-4 (adjusted odds ratio 2.19, = 0.015). Both GPT-4 and custom GPT-4 demonstrated a decrease in performance between questions rated as easy and hard ( < 0.001).
Our study suggests that GPT-4 has higher accuracy than GPT-3.5 in answering Otolaryngology board-style questions. Our custom GPT-4 model demonstrated higher accuracy than standard GPT-4, potentially as a result of its instructions to specialize in Otolaryngology board-style questions, select exactly one answer, and emphasize precision. This demonstrates custom models may further enhance utilization of ChatGPT in medical education.
OpenAI最近发布的GPT-4在GPT-3.5的基础上进行了改进,具有更高的可靠性和扩展功能,包括用户指定的可定制GPT-4模型。本研究旨在调查GPT-4与GPT-3.5在耳鼻喉科板题型问题上的性能更新情况。
从BoardVitals题库中获取了150道耳鼻喉科板题型问题。这些之前用GPT-3.5评估过的问题被输入到标准GPT-4和一个专门设计用于耳鼻喉科板题型问题、强调精确性并提供循证解释的定制GPT-4模型中。
标准GPT-4正确回答了72.0%的问题,定制GPT-4正确回答了81.3%的问题,而GPT-3.5正确回答了相同问题的51.3%。在多变量分析中,定制GPT-4正确回答问题的几率高于标准GPT-4(调整后的优势比为2.19,=0.015)。GPT-4和定制GPT-4在被评为简单和困难的问题之间的性能均有所下降(<0.001)。
我们的研究表明,在回答耳鼻喉科板题型问题方面,GPT-4比GPT-3.5具有更高的准确性。我们的定制GPT-4模型表现出比标准GPT-4更高的准确性,这可能是由于其针对耳鼻喉科板题型问题进行专门设计、精确选择唯一答案并强调精确性的指令所致。这表明定制模型可能会进一步提高ChatGPT在医学教育中的利用率。