Mo Yuanhan, Kontonatsios Georgios, Ananiadou Sophia
School of Computer Science, National Centre for Text Mining, The University of Manchester, Manchester, UK.
Syst Rev. 2015 Nov 26;4:172. doi: 10.1186/s13643-015-0117-0.
Identifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. Recently, a number of studies has shown that the use of machine learning and text mining methods to automatically identify relevant studies has the potential to drastically decrease the workload involved in the screening phase. The vast majority of these machine learning methods exploit the same underlying principle, i.e. a study is modelled as a bag-of-words (BOW).
We explore the use of topic modelling methods to derive a more informative representation of studies. We apply Latent Dirichlet allocation (LDA), an unsupervised topic modelling approach, to automatically identify topics in a collection of studies. We then represent each study as a distribution of LDA topics. Additionally, we enrich topics derived using LDA with multi-word terms identified by using an automatic term recognition (ATR) tool. For evaluation purposes, we carry out automatic identification of relevant studies using support vector machine (SVM)-based classifiers that employ both our novel topic-based representation and the BOW representation.
Our results show that the SVM classifier is able to identify a greater number of relevant studies when using the LDA representation than the BOW representation. These observations hold for two systematic reviews of the clinical domain and three reviews of the social science domain.
A topic-based feature representation of documents outperforms the BOW representation when applied to the task of automatic citation screening. The proposed term-enriched topics are more informative and less ambiguous to systematic reviewers.
识别纳入系统评价的相关研究(即筛选)是一项复杂、费力且昂贵的任务。最近,一些研究表明,使用机器学习和文本挖掘方法自动识别相关研究有可能大幅减少筛选阶段的工作量。这些机器学习方法绝大多数利用相同的基本原理,即把一项研究建模为词袋模型(BOW)。
我们探索使用主题建模方法来获得更具信息性的研究表示。我们应用潜在狄利克雷分配(LDA),一种无监督主题建模方法,来自动识别一组研究中的主题。然后我们将每项研究表示为LDA主题的分布。此外,我们使用自动术语识别(ATR)工具识别的多词术语来丰富通过LDA得出的主题。为了进行评估,我们使用基于支持向量机(SVM)的分类器进行相关研究的自动识别,这些分类器采用我们新颖的基于主题的表示和词袋模型表示。
我们的结果表明,使用LDA表示时,SVM分类器比使用词袋模型表示能够识别出更多的相关研究。这些观察结果适用于临床领域的两项系统评价和社会科学领域的三项评价。
当应用于自动引文筛选任务时,基于主题的文档特征表示优于词袋模型表示。所提出的术语丰富的主题对系统评价者来说更具信息性且歧义更少。