Département d'informatique et de recherche opérationnelle, Université de Montréal, Montréal, Canada.
EPPI-Centre, University College London Institute of Education, London, UK.
Res Synth Methods. 2018 Dec;9(4):587-601. doi: 10.1002/jrsm.1317. Epub 2018 Aug 29.
Identify the most performant automated text classification method (eg, algorithm) for differentiating empirical studies from nonempirical works in order to facilitate systematic mixed studies reviews.
The algorithms were trained and validated with 8050 database records, which had previously been manually categorized as empirical or nonempirical. A Boolean mixed filter developed for filtering MEDLINE records (title, abstract, keywords, and full texts) was used as a baseline. The set of features (eg, characteristics from the data) included observable terms and concepts extracted from a metathesaurus. The efficiency of the approaches was measured using sensitivity, precision, specificity, and accuracy.
The decision trees algorithm demonstrated the highest performance, surpassing the accuracy of the Boolean mixed filter by 30%. The use of full texts did not result in significant gains compared with title, abstract, keywords, and records. Results also showed that mixing concepts with observable terms can improve the classification.
Screening of records, identified in bibliographic databases, for relevant studies to include in systematic reviews can be accelerated with automated text classification.
确定区分经验研究和非经验性文献的最有效自动化文本分类方法(例如算法),以便于系统的混合研究综述。
使用 8050 个已预先手动分类为经验性或非经验性的数据库记录来训练和验证算法。用于过滤 MEDLINE 记录(标题、摘要、关键词和全文)的布尔混合过滤器被用作基线。特征集(例如,从元数据中提取的特征和概念)包括从词库中提取的可观察术语和概念。使用敏感性、精度、特异性和准确性来衡量方法的效率。
决策树算法表现出最高的性能,其准确性比布尔混合过滤器高出 30%。与标题、摘要、关键词和记录相比,使用全文并没有带来显著的收益。结果还表明,将概念与可观察术语混合可以提高分类效果。
通过自动化文本分类,可以加速对文献数据库中记录的筛选,以确定纳入系统综述的相关研究。