Sharp Kellen, Ouellette Rachel R, Singh Rujula Singh Rajendra, DeVito Elise E, Kamdar Neil, de la Noval Amanda, Murthy Dhiraj, Kong Grace
Department of Radio-Television-Film, University of Texas at Austin, Austin, Texas, United States.
Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, United States.
PeerJ Comput Sci. 2025 Mar 14;11:e2710. doi: 10.7717/peerj-cs.2710. eCollection 2025.
Social media research is confronted by the expansive and constantly evolving nature of social media data. Hashtags and keywords are frequently used to identify content related to a specific topic, but these search strategies often result in large numbers of irrelevant results. Therefore, methods are needed to quickly screen social media content based on a specific research question. The primary objective of this article is to present generative artificial intelligence (AI; ., ChatGPT) and machine learning methods to screen content from social media platforms. As a proof of concept, we apply these methods to identify TikTok content related to e-cigarette use during pregnancy.
We searched TikTok for pregnancy and vaping content using 70 hashtag pairs related to "pregnancy" and "vaping" (., #pregnancytok and #ecigarette) to obtain 11,673 distinct posts. We extracted post videos, descriptions, and metadata using Zeeschuimer and PykTok library. To enhance textual analysis, we employed automatic speech recognition the Whisper system to transcribe verbal content from each video. Next, we used the OpenCV library to extract frames from the videos, followed by object and text detection analysis using Oracle Cloud Vision. Finally, we merged all text data to create a consolidated dataset and entered this dataset into ChatGPT-4 to determine which posts are related to vaping and pregnancy. To refine the ChatGPT prompt used to screen for content, a human coder cross-checked ChatGPT-4's outputs for 10 out of every 100 metadata entries, with errors used to inform the final prompt. The final prompt was evaluated through human review, confirming for posts that contain "pregnancy" and "vape" content, comparing determinations to those made by ChatGPT.
Our results indicated ChatGPT-4 classified 44.86% of the videos as exclusively related to pregnancy, 36.91% to vaping, and 8.91% as containing both topics. A human reviewer confirmed for vaping and pregnancy content in 45.38% of the TikTok posts identified by ChatGPT as containing relevant content. Human review of 10% of the posts screened out by ChatGPT identified a 99.06% agreement rate for excluded posts.
ChatGPT has mixed capacity to screen social media content that has been converted into text data using machine learning techniques such as object detection. ChatGPT's sensitivity was found to be lower than a human coder in the current case example but has demonstrated power for screening out irrelevant content and can be used as an initial pass at screening content. Future studies should explore ways to enhance ChatGPT's sensitivity.
社交媒体研究面临着社交媒体数据庞大且不断演变的特性。标签和关键词常被用于识别与特定主题相关的内容,但这些搜索策略往往会产生大量不相关的结果。因此,需要一些方法来基于特定研究问题快速筛选社交媒体内容。本文的主要目的是介绍生成式人工智能(AI;即ChatGPT)和机器学习方法,以筛选来自社交媒体平台的内容。作为概念验证,我们应用这些方法来识别TikTok上与孕期使用电子烟相关的内容。
我们使用70对与“怀孕”和“电子烟”相关的标签(如#pregnancytok和#ecigarette)在TikTok上搜索怀孕和电子烟相关内容,以获取11,673条不同的帖子。我们使用Zeeschuimer和PykTok库提取帖子视频、描述和元数据。为了加强文本分析,我们使用自动语音识别系统Whisper将每个视频中的语音内容转录下来。接下来,我们使用OpenCV库从视频中提取帧,然后使用甲骨文云视觉进行对象和文本检测分析。最后,我们合并所有文本数据以创建一个综合数据集,并将该数据集输入ChatGPT-4,以确定哪些帖子与电子烟和怀孕相关。为了完善用于筛选内容的ChatGPT提示,一名人工编码员对每100个元数据条目中的10个进行了交叉检查,以ChatGPT-4的输出结果中的错误来确定最终提示。通过人工审核对最终提示进行评估,确认包含“怀孕”和“电子烟”内容的帖子,并将判断结果与ChatGPT的判断结果进行比较。
我们的结果表明,ChatGPT-4将44.86%的视频分类为仅与怀孕相关,36.91%与电子烟相关,8.91%同时包含这两个主题。一名人工审核员确认,ChatGPT识别为包含相关内容的TikTok帖子中,有45.38%包含电子烟和怀孕内容。对ChatGPT筛选出的10%的帖子进行人工审核发现,被排除帖子的一致率为99.06%。
ChatGPT在筛选使用对象检测等机器学习技术转换为文本数据的社交媒体内容方面能力参差不齐。在当前案例中,发现ChatGPT的敏感性低于人工编码员,但已证明其在筛选无关内容方面的能力,可作为筛选内容的初步手段。未来的研究应探索提高ChatGPT敏感性的方法。