Marshall Sarah A, Yang Christopher C, Ping Qing, Zhao Mengnan, Avis Nancy E, Ip Edward H
Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC, 27157, USA.
College of Computing and Informatics, Drexel University, Philadelphia, PA, 19104, USA.
Qual Life Res. 2016 Mar;25(3):547-57. doi: 10.1007/s11136-015-1156-7. Epub 2015 Oct 17.
User-generated content on social media sites, such as health-related online forums, offers researchers a tantalizing amount of information, but concerns regarding scientific application of such data remain. This paper compares and contrasts symptom cluster patterns derived from messages on a breast cancer forum with those from a symptom checklist completed by breast cancer survivors participating in a research study.
Over 50,000 messages generated by 12,991 users of the breast cancer forum on MedHelp.org were transformed into a standard form and examined for the co-occurrence of 25 symptoms. The k-medoid clustering method was used to determine appropriate placement of symptoms within clusters. Findings were compared with a similar analysis of a symptom checklist administered to 653 breast cancer survivors participating in a research study.
The following clusters were identified using forum data: menopausal/psychological, pain/fatigue, gastrointestinal, and miscellaneous. Study data generated the clusters: menopausal, pain, fatigue/sleep/gastrointestinal, psychological, and increased weight/appetite. Although the clusters are somewhat different, many symptoms that clustered together in the social media analysis remained together in the analysis of the study participants. Density of connections between symptoms, as reflected by rates of co-occurrence and similarity, was higher in the study data.
The copious amount of data generated by social media outlets can augment findings from traditional data sources. When different sources of information are combined, areas of overlap and discrepancy can be detected, perhaps giving researchers a more accurate picture of reality. However, data derived from social media must be used carefully and with understanding of its limitations.
社交媒体网站上的用户生成内容,如与健康相关的在线论坛,为研究人员提供了大量诱人的信息,但对此类数据的科学应用仍存在担忧。本文比较并对比了乳腺癌论坛上的留言所衍生的症状群模式与参与一项研究的乳腺癌幸存者填写的症状清单所衍生的症状群模式。
MedHelp.org上乳腺癌论坛的12991名用户生成的50000多条留言被转换成标准形式,并对25种症状的同时出现情况进行了检查。使用k-medoid聚类方法来确定症状在各群组中的适当分组。研究结果与对参与一项研究的653名乳腺癌幸存者所填写的症状清单进行的类似分析进行了比较。
使用论坛数据确定了以下群组:更年期/心理、疼痛/疲劳、胃肠道和其他。研究数据生成了以下群组:更年期、疼痛、疲劳/睡眠/胃肠道、心理以及体重/食欲增加。尽管这些群组有所不同,但在社交媒体分析中聚集在一起的许多症状在研究参与者的分析中仍然聚集在一起。研究数据中,由同时出现率和相似度所反映的症状之间的联系密度更高。
社交媒体产生的大量数据可以增强传统数据来源的研究结果。当不同的信息来源结合起来时,可以检测到重叠和差异之处,这或许能让研究人员对现实有更准确的了解。然而,必须谨慎使用源自社交媒体的数据,并了解其局限性。