Schmidgall Samuel, Harris Carl, Essien Ime, Olshvang Daniel, Rahman Tawsifur, Kim Ji Woong, Ziaei Rojin, Eshraghian Jason, Abadir Peter, Chellappa Rama
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA.
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
NPJ Digit Med. 2024 Oct 21;7(1):295. doi: 10.1038/s41746-024-01283-6.
Increasing interest in applying large language models (LLMs) to medicine is due in part to their impressive performance on medical exam questions. However, these exams do not capture the complexity of real patient-doctor interactions because of factors like patient compliance, experience, and cognitive bias. We hypothesized that LLMs would produce less accurate responses when faced with clinically biased questions as compared to unbiased ones. To test this, we developed the BiasMedQA dataset, which consists of 1273 USMLE questions modified to replicate common clinically relevant cognitive biases. We assessed six LLMs on BiasMedQA and found that GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which showed large drops in performance. Additionally, we introduced three bias mitigation strategies, which improved but did not fully restore accuracy. Our findings highlight the need to improve LLMs' robustness to cognitive biases, in order to achieve more reliable applications of LLMs in healthcare.
将大语言模型(LLMs)应用于医学领域的兴趣日益浓厚,部分原因在于它们在医学考试问题上的出色表现。然而,由于患者依从性、经验和认知偏差等因素,这些考试无法体现真实医患互动的复杂性。我们推测,与无偏差问题相比,当面对存在临床偏差的问题时,大语言模型会给出准确性较低的回答。为了验证这一点,我们开发了BiasMedQA数据集,该数据集由1273道美国医师执照考试(USMLE)问题组成,这些问题经过修改以复制常见的临床相关认知偏差。我们在BiasMedQA上评估了六个大语言模型,发现GPT-4在抗偏差方面表现突出,而Llama 2 70B-chat和PMC Llama 13B的性能则大幅下降。此外,我们引入了三种偏差缓解策略,这些策略提高了准确性,但并未完全恢复。我们的研究结果凸显了提高大语言模型对认知偏差的鲁棒性的必要性,以便在医疗保健领域更可靠地应用大语言模型。