Nama Nassr, Iliriani Klevis, Xia Meng Yang, Chen Brian P, Zhou Linghong Linda, Pojsupap Supichaya, Kappel Coralea, O'Hearn Katie, Sampson Margaret, Menon Kusum, McNally James Dayre
Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada.
School of Medicine, Trinity College, Dublin, Ireland.
Transl Pediatr. 2017 Jan;6(1):18-26. doi: 10.21037/tp.2016.12.01.
Completing large systematic reviews and maintaining them up to date poses significant challenges. This is mainly due to the toll required of a small group of experts to screen and extract potentially eligible citations. Automated approaches have failed so far in providing an accessible and adaptable tool to the research community. Over the past decade, crowdsourcing has become attractive in the scientific field, and implementing it in citation screening could save the investigative team significant work and decrease the time to publication.
Citations from the 2015 update of a pediatrics vitamin D systematic review were uploaded to an online platform designed for crowdsourcing the screening process (http://www.CHEORI.org/en/CrowdScreenOverview). Three sets of exclusion criteria were used for screening, with a review of abstracts at level one, and full-text eligibility determined through two screening stages. Two trained reviewers, who participated in the initial systematic review, established citation eligibility. In parallel, each citation received four independent assessments from an untrained crowd with a medical background. Citations were retained or excluded if they received three congruent assessments. Otherwise, they were reviewed by the principal investigator. Measured outcomes included sensitivity of the crowd to retain eligible studies, and potential work saved defined as citations sorted by the crowd (excluded or retained) without involvement of the principal investigator.
A total of 148 citations for screening were identified, of which 20 met eligibility criteria (true positives). The four reviewers from the crowd agreed completely on 63% (95% CI: 57-69%) of assessments, and achieved a sensitivity of 100% (95% CI: 88-100%) and a specificity of 99% (95% CI: 96-100%). Potential work saved to the research team was 84% (95% CI: 77-89%) at the abstract screening stage, and 73% (95% CI: 67-79%) through all three levels. In addition, different thresholds for citation retention and exclusion were assessed. With an algorithm favoring sensitivity (citation excluded only if all four reviewers agree), sensitivity was maintained at 100%, with a decrease of potential work saved to 66% (95% CI: 59-71%). In contrast, increasing the threshold required for retention (exclude all citations not obtaining 3/4 retain assessments) decreased sensitivity to 85% (95% CI: 65-96%), while improving potential workload saved to 92% (95% CI: 88-95%).
This study demonstrates the accuracy of crowdsourcing for systematic review citations screening, with retention of all eligible articles and a significant reduction in the work required from the investigative team. Together, these two findings suggest that crowdsourcing could represent a significant advancement in the area of systematic review. Future directions include further study to assess validity across medical fields and determination of the capacity of a non-medical crowd.
完成大型系统评价并使其保持更新面临重大挑战。这主要是因为一小群专家在筛选和提取潜在合格文献时需要付出巨大努力。自动化方法目前未能为研究界提供一个易于使用且可调整的工具。在过去十年中,众包在科学领域变得颇具吸引力,将其应用于文献筛选可为研究团队节省大量工作并缩短出版时间。
将一篇儿科学维生素D系统评价2015年更新版中的文献上传至一个专为众包筛选过程设计的在线平台(http://www.CHEORI.org/en/CrowdScreenOverview)。使用三组排除标准进行筛选,首先对摘要进行一级审查,然后通过两个筛选阶段确定全文的合格性。两名参与初始系统评价的经过培训的评审员确定文献的合格性。同时,每篇文献会收到四名具有医学背景的未经培训的普通人员的独立评估。如果文献获得三项一致的评估结果,则予以保留或排除。否则,由首席研究员进行审查。测量的结果包括普通人员保留合格研究的敏感性,以及定义为普通人员在首席研究员未参与情况下分类(排除或保留)的文献数量所节省的潜在工作量。
共识别出148篇待筛选的文献,其中20篇符合合格标准(真阳性)。普通人员的四名评审员在63%(95%置信区间:57 - 69%)的评估中完全达成一致,敏感性达到100%(95%置信区间:88 - 100%),特异性为99%(95%置信区间:96 - 100%)。在摘要筛选阶段,研究团队节省的潜在工作量为84%(95%置信区间:77 - 89%),在所有三个筛选层次中为73%(95%置信区间:67 - 79%)。此外,还评估了不同的文献保留和排除阈值。采用倾向于敏感性的算法(仅当所有四名评审员都同意时才排除文献),敏感性保持在100%,但节省的潜在工作量降至66%(95%置信区间:59 - 71%)。相反,提高保留所需的阈值(排除所有未获得3/4保留评估的文献)会使敏感性降至85%(95%置信区间:65 - 96%),同时将节省的潜在工作量提高至92%(95%置信区间:88 - 95%)。
本研究证明了众包用于系统评价文献筛选的准确性,能够保留所有合格文章,并显著减少研究团队所需的工作量。这两个发现共同表明,众包可能代表了系统评价领域的一项重大进展。未来的方向包括进一步研究以评估在各个医学领域的有效性,以及确定非医学普通人员的能力。