Chen Shan, Gallifant Jack, Gao Mingye, Moreira Pedro, Munch Nikolaj, Muthukkumar Ajay, Rajan Arvind, Kolluri Jaya, Fiske Amelia, Hastings Janna, Aerts Hugo, Anthony Brian, Celi Leo Anthony, La Cava William G, Bitterman Danielle S
Harvard.
Mass General Brigham.
Adv Neural Inf Process Syst. 2024;37(D-ampB):23756-23795.
Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce , the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.
大语言模型(LLMs)在处理自然语言方面变得越来越重要,然而它们的应用常常受到源自训练数据的偏差和不准确之处的影响。在本研究中,我们引入了第一个专门用于评估大语言模型中的偏差和现实世界知识的基准框架,特别关注不同人口群体中疾病患病率的呈现。我们系统地评估了诸如预训练语料库中嵌入的人口统计学偏差如何影响大语言模型的输出。通过将这些偏差与美国不同人口群体的实际疾病患病率并列比较,我们揭示并量化了差异。我们的结果凸显了大语言模型对疾病患病率的呈现与不同人口亚组的实际疾病患病率之间存在显著偏差,这表明存在明显的偏差传播风险,且大语言模型在医学应用中缺乏现实世界依据。此外,我们观察到各种对齐方法在最小程度上解决了模型在不同语言中对疾病患病率呈现的不一致性问题。为了进一步探索和分析,我们在以下网址提供所有数据和一个数据可视化工具:www.crosscare.net。