Jowsey Tanisha, Stapleton Peta, Campbell Shawna, Davidson Alexandra, McGillivray Cher, Maugeri Isabella, Lee Megan, Keogh Justin
Faculty of Health Sciences and Medicine, Bond University, Gold Coast, Australia.
Faculty of Society and Design, Bond University, Gold Coast, Australia.
PLoS One. 2025 Sep 5;20(9):e0330217. doi: 10.1371/journal.pone.0330217. eCollection 2025.
To determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.
With the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.
We searched three databases (United Kingdom Data Service, Figshare, and Google Scholar) and five journals (PlosOne, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) to identify studies on health-related topics, published prior to whereby: humans undertook thematic analysis and published both their analysis in a peer-reviewed journal and the associated dataset. We prompted a closed system GenAI (Microsoft Copilot) to undertake thematic analysis of these datasets and analysed the GenAI outputs in comparison with human outputs. Measures include time (GenAI only), accuracy, overlap with human analysis, and reliability of selected data and quotes.
Five studies were identified that met our inclusion criteria. The themes identified by human researchers and Copilot showed minimal overlap, with human researchers often using discursive thematic analyses (40%) and Copilot focusing on thematic analysis (100%). Copilot's outputs often included fabricated quotes (58% SD = 45%) and none of the Copilot outputs provided participant spread by theme. Additionally, Copilot's outputs primarily drew themes and quotes from the first 2-3 pages of textual data, rather than from the entire dataset. Human researchers provided broader representation and accurate quotes (79% quotes were correct, SD = 27%).
Based on these results, we cannot recommend the current version of Copilot for undertaking thematic analyses. This study raises concerns about the validity of both human-generated and GenAI-generated qualitative data analysis and reporting.
确定使用生成式人工智能(GenAI)进行主题分析的准确性和效率。
随着GenAI在数据分析中的使用日益增加,需要测试使用GenAI进行定性数据分析的可靠性和适用性。我们提出了一种方法,供研究人员使用去识别化的定性数据集来评估GenAI输出的可靠性。
我们搜索了三个数据库(英国数据服务中心、Figshare和谷歌学术)以及五本期刊(《公共科学图书馆·综合》《社会科学与医学》《定性调查》《定性研究》《社会学健康评论》),以识别与健康相关主题的研究,这些研究在以下时间之前发表:人类进行了主题分析,并在同行评审期刊上发表了他们的分析以及相关数据集。我们促使一个封闭系统的GenAI(微软Copilot)对这些数据集进行主题分析,并将GenAI的输出与人类的输出进行比较分析。衡量指标包括时间(仅GenAI)、准确性、与人类分析的重叠度以及所选数据和引语的可靠性。
确定了五项符合我们纳入标准的研究。人类研究人员和Copilot识别出的主题重叠极少,人类研究人员经常使用话语主题分析(40%),而Copilot专注于主题分析(100%)。Copilot的输出经常包含编造的引语(标准差为45%,比例为58%),且没有一个Copilot的输出按主题提供参与者分布情况。此外,Copilot的输出主要从前2至3页的文本数据中提取主题和引语,而不是从整个数据集中提取。人类研究人员提供了更广泛的代表性和准确的引语(79%的引语正确,标准差为27%)。
基于这些结果,我们不建议使用当前版本的Copilot进行主题分析。这项研究引发了对人类生成和GenAI生成的定性数据分析及报告有效性的担忧。