Amirova Aliya, Fteropoulli Theodora, Ahmed Nafiso, Cowie Martin R, Leibo Joel Z
Population Health Sciences, School of Life Course & Population Sciences, Faculty of Life Sciences & Medicine, King's College London, London, United Kingdom.
Medical School, University of Cyprus, Nicosia, Cyprus.
PLoS One. 2024 Mar 12;19(3):e0300024. doi: 10.1371/journal.pone.0300024. eCollection 2024.
Today, with the advent of Large-scale generative Language Models (LLMs) it is now possible to simulate free responses to interview questions such as those traditionally analyzed using qualitative research methods. Qualitative methodology encompasses a broad family of techniques involving manual analysis of open-ended interviews or conversations conducted freely in natural language. Here we consider whether artificial "silicon participants" generated by LLMs may be productively studied using qualitative analysis methods in such a way as to generate insights that could generalize to real human populations. The key concept in our analysis is algorithmic fidelity, a validity concept capturing the degree to which LLM-generated outputs mirror human sub-populations' beliefs and attitudes. By definition, high algorithmic fidelity suggests that latent beliefs elicited from LLMs may generalize to real humans, whereas low algorithmic fidelity renders such research invalid. Here we used an LLM to generate interviews with "silicon participants" matching specific demographic characteristics one-for-one with a set of human participants. Using framework-based qualitative analysis, we showed the key themes obtained from both human and silicon participants were strikingly similar. However, when we analyzed the structure and tone of the interviews we found even more striking differences. We also found evidence of a hyper-accuracy distortion. We conclude that the LLM we tested (GPT-3.5) does not have sufficient algorithmic fidelity to expect in silico research on it to generalize to real human populations. However, rapid advances in artificial intelligence raise the possibility that algorithmic fidelity may improve in the future. Thus we stress the need to establish epistemic norms now around how to assess the validity of LLM-based qualitative research, especially concerning the need to ensure the representation of heterogeneous lived experiences.
如今,随着大规模生成式语言模型(LLMs)的出现,现在有可能模拟对面试问题的自由回答,比如那些传统上使用定性研究方法进行分析的问题。定性方法涵盖了一系列广泛的技术,包括对开放式访谈或自然语言自由进行的对话进行人工分析。在这里,我们考虑由语言模型生成的人工“硅参与者”是否可以通过定性分析方法进行有效研究,从而产生能够推广到真实人类群体的见解。我们分析中的关键概念是算法保真度,这是一个有效性概念,用于衡量语言模型生成的输出反映人类亚群体的信念和态度的程度。根据定义,高算法保真度表明从语言模型中引出的潜在信念可能推广到真实人类,而低算法保真度则使此类研究无效。在这里,我们使用一个语言模型对与一组人类参与者一对一匹配特定人口统计学特征的“硅参与者”进行访谈。通过基于框架的定性分析,我们表明从人类和硅参与者那里获得的关键主题惊人地相似。然而,当我们分析访谈的结构和语气时,我们发现了更惊人的差异。我们还发现了超准确性扭曲的证据。我们得出结论,我们测试的语言模型(GPT - 3.5)没有足够的算法保真度来期望对其进行的计算机模拟研究能够推广到真实人类群体。然而,人工智能的快速发展增加了算法保真度未来可能提高的可能性。因此,我们强调现在需要围绕如何评估基于语言模型的定性研究的有效性建立认知规范,特别是关于确保代表性异质生活经历的必要性。