Holmes Jason, Liu Zhengliang, Zhang Lian, Ding Yuzhen, Sio Terence T, McGee Lisa A, Ashman Jonathan B, Li Xiang, Liu Tianming, Shen Jiajian, Liu Wei
Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, United States.
School of Computing, The University of Georgia, Athens, GA, United States.
Front Oncol. 2023 Jul 17;13:1219326. doi: 10.3389/fonc.2023.1219326. eCollection 2023.
We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.
We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.
ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.
This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.
我们开展了第一项研究,以调查大语言模型(LLMs)在回答放射肿瘤物理问题方面的表现。由于像AP物理、LSAT和GRE等热门考试的考生人数众多,且市面上有大量的备考资源,它们可能无法准确评估大语言模型的真正潜力。本文提议在一个高度专业化的主题——放射肿瘤物理上评估大语言模型,这除了是大语言模型的一个有价值的基准外,可能对科学界和医学界更具相关性。
我们根据自身专业知识开发了一个由100道放射肿瘤物理问题组成的考试。对四个大语言模型ChatGPT(GPT - 3.5)、ChatGPT(GPT - 4)、Bard(LaMDA)和BLOOMZ,以及医学物理师和非专业人员进行了评估。通过先要求ChatGPT(GPT - 4)解释然后回答的方式,进一步探究了其性能。使用一种新颖的方法(将正确答案替换为“以上选项均不正确”)评估了ChatGPT(GPT - 4)的演绎推理能力。采用多数投票分析来近似估算每个组共同答题时的得分情况。
平均而言,ChatGPT GPT - 4在所有其他大语言模型和医学物理师中表现最佳,在被提示先解释再回答时准确性有所提高。ChatGPT(GPT - 3.5和GPT - 4)在多次试验中的答案选择显示出高度的一致性,无论正确与否,这一特征在人类测试组或Bard(LaMDA)中并未观察到。在评估演绎推理能力时,ChatGPT(GPT - 4)表现出惊人的准确性,表明可能存在一种新兴能力。最后,尽管ChatGPT(GPT - 4)总体表现良好,但基于多次试验的多数投票得分时,其固有特性不允许进一步提高。相比之下,一组医学物理师通过多数投票能够大大超越ChatGPT(GPT - 4)。
本研究表明大语言模型有很大潜力作为知识渊博的助手与放射肿瘤专家协同工作。