Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA.
Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA.
Comput Methods Programs Biomed. 2024 Oct;255:108356. doi: 10.1016/j.cmpb.2024.108356. Epub 2024 Jul 24.
Large language models (LLMs) are generative artificial intelligence that have ignited much interest and discussion about their utility in clinical and research settings. Despite this interest there is sparse analysis of their use in qualitative thematic analysis comparing their current ability to that of human coding and analysis. In addition, there has been no published analysis of their use in real-world, protected health information.
Here we fill that gap in the literature by comparing an LLM to standard human thematic analysis in real-world, semi-structured interviews of both patients and clinicians within a psychiatric setting.
Using a 70 billion parameter open-source LLM running on local hardware and advanced prompt engineering techniques, we produced themes that summarized a full corpus of interviews in minutes. Subsequently we used three different evaluation methods for quantifying similarity between themes produced by the LLM and those produced by humans.
These revealed similarities ranging from moderate to substantial (Jaccard similarity coefficients 0.44-0.69), which are promising preliminary results.
Our study demonstrates that open-source LLMs can effectively generate robust themes from qualitative data, achieving substantial similarity to human-generated themes. The validation of LLMs in thematic analysis, coupled with evaluation methodologies, highlights their potential to enhance and democratize qualitative research across diverse fields.
大型语言模型(LLMs)是生成式人工智能,它们在临床和研究环境中的实用性引起了广泛关注和讨论。尽管人们对此很感兴趣,但很少有分析将其与人类编码和分析进行比较,以评估其在定性主题分析中的应用。此外,也没有关于其在真实的、受保护的健康信息中的应用的已发表分析。
本研究通过在精神科环境中对患者和临床医生进行的真实半结构化访谈,将 LLM 与标准的人类主题分析进行比较,填补了这一文献空白。
我们使用一个 700 亿参数的开源 LLM,在本地硬件上运行,并采用先进的提示工程技术,在数分钟内生成了对整个访谈语料库的总结主题。随后,我们使用三种不同的评估方法来量化 LLM 生成的主题与人类生成的主题之间的相似性。
这些结果显示出相似性从中等到高度(Jaccard 相似系数为 0.44-0.69),这是有希望的初步结果。
我们的研究表明,开源 LLM 可以有效地从定性数据中生成强大的主题,与人类生成的主题具有高度相似性。对 LLM 在主题分析中的验证,以及评估方法的使用,突出了它们在不同领域增强和民主化定性研究的潜力。