Department of Ophthalmology and Francis I Proctor Foundation, University of California San Francisco, San Francisco, CA, United States.
Center for Vulnerable Populations, Zuckerberg San Francisco General Hospital, Department of Medicine, University of California San Francisco, San Francisco, CA, United States.
JMIR Infodemiology. 2024 Aug 29;4:e59641. doi: 10.2196/59641.
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes.
We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts?
We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM.
The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant.
LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
从社交媒体上手动分析与公共卫生相关的内容,可以深入了解个人的信仰、态度和行为,揭示趋势和模式,从而为公众理解、政策决策、有针对性的干预措施和沟通策略提供信息。不幸的是,需要经过良好训练的人类主题专家投入大量时间和精力,这使得广泛的社交媒体监听变得不可行。生成式大型语言模型(LLM)可以潜在地总结和解释大量文本,但目前尚不清楚 LLM 可以在大量社交媒体帖子中提取出微妙的与健康相关的含义,并合理地报告与健康相关的主题。
我们旨在评估 LLM 用于主题模型选择或对大量社交媒体帖子进行归纳主题分析的可行性,方法是尝试回答以下问题:LLM 能否像人类在先前的手动研究中那样有效地进行主题模型选择和归纳主题分析,或者至少像主题专家判断的那样合理地进行?
我们提出了相同的研究问题,并使用相同的社交媒体内容集,由 LLM 选择相关主题,并由 LLM 分析主题,就像在一项关于疫苗言论的已发表研究中手动进行的那样。我们使用该研究的结果作为本 LLM 实验的背景,通过将先前手动人类分析的结果与 3 个 LLM 的分析结果进行比较:GPT4-32K、Claude-instant-100K 和 Claude-2-100K。我们还评估了多个 LLM 是否具有同等能力,并评估了每个 LLM 的重复分析的一致性。
LLM 通常会对人类先前选择的最相关主题进行高度排序。我们拒绝零假设(P<.001,总体比较),并得出结论,这些 LLM 更有可能将人类评级最高的前 5 个内容领域包含在其最高排名中,而不是随机发生。关于主题识别,LLM 识别出了与人类识别出的几个主题相似的主题,且幻觉率非常低。LLM 之间和单个 LLM 的测试运行之间存在差异。尽管 LLM 生成的主题并未始终与人类生成的主题匹配,但主题专家认为 LLM 生成的主题仍然合理且相关。
LLM 可以有效地、高效地处理基于社交媒体的大型健康相关数据集。LLM 可以从这些数据中提取出主题,这些主题被主题专家认为是合理的。然而,我们无法证明我们测试的 LLM 可以通过从相同的数据中一致地提取相同的主题来复制人类主题专家的深度分析。一旦得到更好的验证,LLM 就具有巨大的潜力,可以实现基于自动化 LLM 的实时社交媒体监听,用于常见和罕见的健康状况,从而增进公众对公众利益和关注的理解,并确定公众解决这些问题的想法。