Suppr超能文献

GPT-4 与人类研究人员在医疗数据分析中的比较:定性描述研究。

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study.

机构信息

Department of Urology, University of California San Francisco, San Francisco, CA, United States.

Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, United States.

出版信息

J Med Internet Res. 2024 Aug 21;26:e56500. doi: 10.2196/56500.

Abstract

BACKGROUND

Large language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared with traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored.

OBJECTIVE

We evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews with patients with adult-acquired buried penis (AABP).

METHODS

Qualitative data were obtained from semistructured interviews with 20 patients with AABP. Human analysis involved a structured 3-stage process-initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: (1) a naïve phase, where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes and subthemes and (2) a comparison phase, where AI-generated themes were compared with human-identified themes to assess agreement. We used a general qualitative description approach.

RESULTS

The study population (N=20) comprised predominantly White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m. Human qualitative analysis identified "urinary issues" in 95% (19/20) and GPT-4 in 75% (15/20) of interviews, with the subtheme "spray or stream" noted in 60% (12/20) and 35% (7/20), respectively. "Sexual issues" were prominent (19/20, 95% humans vs 16/20, 80% GPT-4), although humans identified a wider range of subthemes, including "pain with sex or masturbation" (7/20, 35%) and "difficulty with sex or masturbation" (4/20, 20%). Both analyses similarly highlighted "mental health issues" (11/20, 55%, both), although humans coded "depression" more frequently (10/20, 50% humans vs 4/20, 20% GPT-4). Humans frequently cited "issues using public restrooms" (12/20, 60%) as impacting social life, whereas GPT-4 emphasized "struggles with romantic relationships" (9/20, 45%). "Hygiene issues" were consistently recognized (14/20, 70% humans vs 13/20, 65% GPT-4). Humans uniquely identified "contributing factors" as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (κ=0.401). Reliability assessments of GPT-4's analyses showed consistent coding for themes including "body image struggles," "chronic pain" (10/10, 100%), and "depression" (9/10, 90%). Other themes like "motivation for surgery" and "weight challenges" were reliably coded (8/10, 80%), while less frequent themes were variably identified across multiple iterations.

CONCLUSIONS

Large language models including GPT-4 can effectively identify key themes in analyzing qualitative health care data, showing moderate agreement with human analysis. While human analysis provided a richer diversity of subthemes, the consistency of AI suggests its use as a complementary tool in qualitative research. With AI rapidly advancing, future studies should iterate analyses and circumvent token limitations by segmenting data, furthering the breadth and depth of large language model-driven qualitative analyses.

摘要

背景

包括 GPT-4(OpenAI)在内的大型语言模型在医疗保健和定性研究领域开辟了新的途径。传统的定性方法耗时且需要专业知识来捕捉细微差别。虽然大型语言模型在上下文理解和推理方面表现出优于传统自然语言处理的能力,但它们在定性分析方面的表现与人类相比仍有待探索。

目的

我们评估 GPT-4 与人类研究人员在分析成人获得性埋藏阴茎(AABP)患者访谈的定性分析中的有效性。

方法

从 20 名 AABP 患者的半结构化访谈中获得定性数据。人类分析涉及一个结构化的 3 阶段过程——初始观察、逐行编码和共识讨论以细化主题。相比之下,人工智能(AI)分析使用 GPT-4 经历了两个阶段:(1)一个天真的阶段,GPT-4 的输出由一个盲审员独立评估,以识别主题和子主题,(2)一个比较阶段,比较 AI 生成的主题与人类识别的主题以评估一致性。我们使用了一种通用的定性描述方法。

结果

研究人群(N=20)主要由白人(17/20,85%)、已婚(12/20,60%)、异性恋(19/20,95%)男性组成,平均年龄为 58.8 岁,BMI 为 41.1kg/m。人类定性分析在 95%(19/20)的访谈中识别出“尿问题”,GPT-4 在 75%(15/20)的访谈中识别出“尿问题”,其中 60%(12/20)和 35%(7/20)分别识别出“喷尿或尿流”和“尿流细”。“性问题”很突出(19/20,95%人类 vs 16/20,80% GPT-4),尽管人类识别出更广泛的子主题,包括“性交或自慰时疼痛”(7/20,35%)和“性交或自慰困难”(4/20,20%)。两种分析都同样强调“心理健康问题”(11/20,55%,两者),尽管人类更频繁地编码“抑郁”(10/20,50%人类 vs 4/20,20% GPT-4)。人类经常引用“在公共厕所使用问题”(12/20,60%)作为影响社交生活的因素,而 GPT-4 则强调“浪漫关系中的困难”(9/20,45%)。“卫生问题”(14/20,70%人类 vs 13/20,65% GPT-4)始终被识别。人类在所有访谈中唯一识别出“促成因素”作为一个主题。人类和 GPT-4 编码之间存在中度一致性(κ=0.401)。GPT-4 分析的可靠性评估显示,包括“身体形象挣扎”、“慢性疼痛”(10/10,100%)和“抑郁”(9/10,90%)在内的主题的编码一致。其他主题,如“手术动机”和“体重挑战”也可靠地编码(8/10,80%),而较少出现的主题在多个迭代中则有所不同。

结论

包括 GPT-4 在内的大型语言模型可以有效地识别分析定性医疗保健数据的关键主题,与人类分析显示出中度一致性。虽然人类分析提供了更丰富的子主题多样性,但 AI 的一致性表明其可作为定性研究的补充工具。随着人工智能的迅速发展,未来的研究应该迭代分析,并通过分段数据来避免令牌限制,进一步深化和拓宽大型语言模型驱动的定性分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e7e/11375389/6ec268b330cc/jmir_v26i1e56500_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验