通过机器学习与众包相结合的方法识别随机对照试验(RCT)报告。

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach.

作者信息

Wallace Byron C, Noel-Storr Anna, Marshall Iain J, Cohen Aaron M, Smalheiser Neil R, Thomas James

机构信息

College of Computer and Information Science, Northeastern University, Boston MA, USA.

Radcliffe Department of Medicine, University of Oxford, Oxford, UK.

出版信息

J Am Med Inform Assoc. 2017 Nov 1;24(6):1165-1168. doi: 10.1093/jamia/ocx053.

Abstract

OBJECTIVES

Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed to make this process more efficient via a hybrid approach using both crowdsourcing and ML.

METHODS

We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise.

RESULTS

Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone.

CONCLUSIONS

Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

摘要

目标

识别所有已发表的随机对照试验(RCT)报告是一项重要目标,但即使使用当前的机器学习(ML)方法,也需要大量人工努力才能将RCT与非RCT区分开来。我们旨在通过使用众包和ML的混合方法使这一过程更高效。

方法

我们训练了一个分类器,以区分描述RCT的引文和不描述RCT的引文。然后,我们采用了一种简单的策略,即自动排除分类器认为极不可能是RCT的引文,否则将其交给众包工作者处理。

结果

将ML与众包相结合提供了一种高度敏感的RCT识别策略(我们的估计表明召回率为95%-99%),与仅依靠人工筛选相比,工作量大大减少(我们观察到减少了约60%-80%)。

结论

混合众包-ML策略值得在生物医学编目/注释任务中进一步探索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d9f/5975623/87b1e93f03a5/ocx053f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索