School of Psychological Science, University of Bristol, Bristol, United Kingdom.
MRC Integrative Epidemiology Unit, University of Bristol, Bristol, United Kingdom.
J Med Internet Res. 2023 May 8;25:e42734. doi: 10.2196/42734.
The use of social media data to predict mental health outcomes has the potential to allow for the continuous monitoring of mental health and well-being and provide timely information that can supplement traditional clinical assessments. However, it is crucial that the methodologies used to create models for this purpose are of high quality from both a mental health and machine learning perspective. Twitter has been a popular choice of social media because of the accessibility of its data, but access to big data sets is not a guarantee of robust results.
This study aims to review the current methodologies used in the literature for predicting mental health outcomes from Twitter data, with a focus on the quality of the underlying mental health data and the machine learning methods used.
A systematic search was performed across 6 databases, using keywords related to mental health disorders, algorithms, and social media. In total, 2759 records were screened, of which 164 (5.94%) papers were analyzed. Information about methodologies for data acquisition, preprocessing, model creation, and validation was collected, as well as information about replicability and ethical considerations.
The 164 studies reviewed used 119 primary data sets. There were an additional 8 data sets identified that were not described in enough detail to include, and 6.1% (10/164) of the papers did not describe their data sets at all. Of these 119 data sets, only 16 (13.4%) had access to ground truth data (ie, known characteristics) about the mental health disorders of social media users. The other 86.6% (103/119) of data sets collected data by searching keywords or phrases, which may not be representative of patterns of Twitter use for those with mental health disorders. The annotation of mental health disorders for classification labels was variable, and 57.1% (68/119) of the data sets had no ground truth or clinical input on this annotation. Despite being a common mental health disorder, anxiety received little attention.
The sharing of high-quality ground truth data sets is crucial for the development of trustworthy algorithms that have clinical and research utility. Further collaboration across disciplines and contexts is encouraged to better understand what types of predictions will be useful in supporting the management and identification of mental health disorders. A series of recommendations for researchers in this field and for the wider research community are made, with the aim of enhancing the quality and utility of future outputs.
利用社交媒体数据预测心理健康结果具有持续监测心理健康和幸福感的潜力,并提供可以补充传统临床评估的及时信息。但是,从心理健康和机器学习的角度来看,用于为此目的创建模型的方法必须具有高质量,这一点至关重要。由于其数据的可访问性,Twitter 一直是社交媒体的热门选择,但访问大数据集并不能保证结果稳健。
本研究旨在综述文献中用于从 Twitter 数据预测心理健康结果的当前方法,重点关注基础心理健康数据的质量和使用的机器学习方法。
通过与心理健康障碍、算法和社交媒体相关的关键字,在 6 个数据库中进行了系统搜索。共筛选了 2759 条记录,其中分析了 164 篇(5.94%)论文。收集了有关数据采集、预处理、模型创建和验证的方法信息,以及可重复性和道德考虑因素的信息。
综述的 164 项研究使用了 119 个原始数据集。另外还确定了 8 个数据集,但没有详细描述,无法纳入研究,6.1%(16/164)的论文根本没有描述他们的数据。在这 119 个数据集中,只有 16 个(13.4%)可以访问社交媒体用户心理健康障碍的真实数据(即已知特征)。其余 86.6%(103/119)的数据集通过搜索关键词或短语收集数据,这可能不能代表患有心理健康障碍的人使用 Twitter 的模式。心理健康障碍的分类标签注释是可变的,在 119 个数据集中,57.1%(68/119)的数据集中没有关于此注释的真实数据或临床输入。尽管焦虑是一种常见的心理健康障碍,但它几乎没有受到关注。
共享高质量的真实数据集对于开发具有临床和研究实用性的可信算法至关重要。鼓励跨学科和跨背景进行进一步合作,以更好地了解哪些类型的预测将有助于支持心理健康障碍的管理和识别。为该领域的研究人员和更广泛的研究界提出了一系列建议,旨在提高未来研究成果的质量和实用性。