Department of Computer Science, Tufts University, Medford, MA, USA.
BMC Bioinformatics. 2010 Jan 26;11:55. doi: 10.1186/1471-2105-11-55.
BACKGROUND: Systematic reviews address a specific clinical question by unbiasedly assessing and analyzing the pertinent literature. Citation screening is a time-consuming and critical step in systematic reviews. Typically, reviewers must evaluate thousands of citations to identify articles eligible for a given review. We explore the application of machine learning techniques to semi-automate citation screening, thereby reducing the reviewers' workload. RESULTS: We present a novel online classification strategy for citation screening to automatically discriminate "relevant" from "irrelevant" citations. We use an ensemble of Support Vector Machines (SVMs) built over different feature-spaces (e.g., abstract and title text), and trained interactively by the reviewer(s). Semi-automating the citation screening process is difficult because any such strategy must identify all citations eligible for the systematic review. This requirement is made harder still due to class imbalance; there are far fewer "relevant" than "irrelevant" citations for any given systematic review. To address these challenges we employ a custom active-learning strategy developed specifically for imbalanced datasets. Further, we introduce a novel undersampling technique. We provide experimental results over three real-world systematic review datasets, and demonstrate that our algorithm is able to reduce the number of citations that must be screened manually by nearly half in two of these, and by around 40% in the third, without excluding any of the citations eligible for the systematic review. CONCLUSIONS: We have developed a semi-automated citation screening algorithm for systematic reviews that has the potential to substantially reduce the number of citations reviewers have to manually screen, without compromising the quality and comprehensiveness of the review.
背景:系统评价通过公正地评估和分析相关文献来解决特定的临床问题。引文筛选是系统评价中耗时且关键的步骤。通常,评审员必须评估数千条引文,以确定符合特定综述的文章。我们探讨了应用机器学习技术来半自动筛选引文,从而减轻评审员的工作量。
结果:我们提出了一种新颖的在线引文筛选分类策略,以自动区分“相关”和“不相关”的引文。我们使用基于不同特征空间(例如摘要和标题文本)的集成支持向量机(SVM),并由评审员进行交互式训练。半自动引文筛选过程很困难,因为任何这样的策略都必须识别出所有符合系统评价的引文。由于类不平衡,这一要求更加困难;对于任何给定的系统评价,“相关”引文比“不相关”引文少得多。为了应对这些挑战,我们采用了专门为不平衡数据集开发的自定义主动学习策略。此外,我们引入了一种新颖的欠采样技术。我们在三个真实的系统评价数据集上提供了实验结果,结果表明,在其中两个数据集上,我们的算法能够将必须手动筛选的引文数量减少近一半,而在第三个数据集上,减少约 40%,而不会排除任何符合系统评价的引文。
结论:我们开发了一种用于系统评价的半自动引文筛选算法,它有可能在不影响评价质量和全面性的情况下,大大减少评审员必须手动筛选的引文数量。
BMC Bioinformatics. 2010-1-26
J Biomed Inform. 2017-8
J Biomed Inform. 2016-8
Artif Intell Med. 2012-6-5
NPJ Digit Med. 2025-8-8
AMIA Jt Summits Transl Sci Proc. 2025-6-10
BMC Med Res Methodol. 2025-1-15
Int Urogynecol J. 2025-1
Int Urogynecol J. 2025-1
PLOS Digit Health. 2024-9-23
Ann Intern Med. 2009-10-20
Infect Dis Clin North Am. 2009-6
Am J Clin Nutr. 2009-4
J Am Med Inform Assoc. 2009
Brief Bioinform. 2007-9
Stud Health Technol Inform. 2007
BMC Bioinformatics. 2006-8-7
AMIA Annu Symp Proc. 2005
Mol Cell. 2006-3-3