Department of Computer Science, University of Texas at Austin, Austin, TX, United States.
JMIR Med Inform. 2016 Sep 22;4(3):e27. doi: 10.2196/medinform.5353.
Health science findings are primarily disseminated through manuscript publications. Information subsidies are used to communicate newsworthy findings to journalists in an effort to earn mass media coverage and further disseminate health science research to mass audiences. Journal editors and news journalists then select which news stories receive coverage and thus public attention.
This study aims to identify attributes of published health science articles that correlate with (1) journal editor issuance of press releases and (2) mainstream media coverage.
We constructed four novel datasets to identify factors that correlate with press release issuance and media coverage. These corpora include thousands of published articles, subsets of which received press release or mainstream media coverage. We used statistical machine learning methods to identify correlations between words in the science abstracts and press release issuance and media coverage. Further, we used a topic modeling-based machine learning approach to uncover latent topics predictive of the perceived newsworthiness of science articles.
Both press release issuance for, and media coverage of, health science articles are predictable from corresponding journal article content. For the former task, we achieved average areas under the curve (AUCs) of 0.666 (SD 0.019) and 0.882 (SD 0.018) on two separate datasets, comprising 3024 and 10,760 articles, respectively. For the latter task, models realized mean AUCs of 0.591 (SD 0.044) and 0.783 (SD 0.022) on two datasets-in this case containing 422 and 28,910 pairs, respectively. We reported most-predictive words and topics for press release or news coverage.
We have presented a novel data-driven characterization of content that renders health science "newsworthy." The analysis provides new insights into the news coverage selection process. For example, it appears epidemiological papers concerning common behaviors (eg, alcohol consumption) tend to receive media attention.
健康科学研究成果主要通过学术论文发表来传播。信息补贴用于向新闻记者传达有新闻价值的发现,以努力获得大众媒体的报道,并将健康科学研究进一步传播给大众。然后,期刊编辑和新闻记者选择哪些新闻报道获得报道,从而引起公众关注。
本研究旨在确定与(1)期刊编辑发布新闻稿和(2)主流媒体报道相关的已发表健康科学文章的属性。
我们构建了四个新的数据集,以确定与新闻稿发布和媒体报道相关的因素。这些语料库包含数千篇已发表的文章,其中一些子集收到了新闻稿或主流媒体的报道。我们使用统计机器学习方法来识别科学摘要中单词与新闻稿发布和媒体报道之间的相关性。此外,我们使用基于主题建模的机器学习方法来揭示可预测科学文章新闻价值的潜在主题。
健康科学文章的新闻稿发布和媒体报道都可以从相应的期刊文章内容中预测出来。对于前者的任务,我们在两个独立的数据集上分别实现了 0.666(SD 0.019)和 0.882(SD 0.018)的平均曲线下面积(AUC),分别包含 3024 篇和 10760 篇文章。对于后者的任务,模型在两个数据集上分别实现了 0.591(SD 0.044)和 0.783(SD 0.022)的平均 AUC-在这种情况下,分别包含 422 对和 28910 对文章。我们报告了最具预测性的单词和主题,用于新闻稿或新闻报道。
我们提出了一种新颖的基于数据的健康科学“新闻价值”内容描述。该分析提供了对新闻报道选择过程的新见解。例如,似乎涉及常见行为(例如饮酒)的流行病学论文更容易受到媒体关注。