Cohen Aaron M, Smalheiser Neil R, McDonagh Marian S, Yu Clement, Adams Clive E, Davis John M, Yu Philip S
Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239 USA
Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612 USA.
J Am Med Inform Assoc. 2015 May;22(3):707-17. doi: 10.1093/jamia/ocu025. Epub 2015 Feb 5.
For many literature review tasks, including systematic review (SR) and other aspects of evidence-based medicine, it is important to know whether an article describes a randomized controlled trial (RCT). Current manual annotation is not complete or flexible enough for the SR process. In this work, highly accurate machine learning predictive models were built that include confidence predictions of whether an article is an RCT.
The LibSVM classifier was used with forward selection of potential feature sets on a large human-related subset of MEDLINE to create a classification model requiring only the citation, abstract, and MeSH terms for each article.
The model achieved an area under the receiver operating characteristic curve of 0.973 and mean squared error of 0.013 on the held out year 2011 data. Accurate confidence estimates were confirmed on a manually reviewed set of test articles. A second model not requiring MeSH terms was also created, and performs almost as well.
Both models accurately rank and predict article RCT confidence. Using the model and the manually reviewed samples, it is estimated that about 8000 (3%) additional RCTs can be identified in MEDLINE, and that 5% of articles tagged as RCTs in Medline may not be identified.
Retagging human-related studies with a continuously valued RCT confidence is potentially more useful for article ranking and review than a simple yes/no prediction. The automated RCT tagging tool should offer significant savings of time and effort during the process of writing SRs, and is a key component of a multistep text mining pipeline that we are building to streamline SR workflow. In addition, the model may be useful for identifying errors in MEDLINE publication types. The RCT confidence predictions described here have been made available to users as a web service with a user query form front end at: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi.
对于许多文献综述任务,包括系统综述(SR)以及循证医学的其他方面,了解一篇文章是否描述了随机对照试验(RCT)很重要。当前的手动标注对于SR过程而言不够完整或灵活。在这项研究中,构建了高度准确的机器学习预测模型,该模型包括关于一篇文章是否为RCT的置信度预测。
使用LibSVM分类器,并在MEDLINE中一个与人类相关的大型子集中对潜在特征集进行前向选择,以创建一个仅需每篇文章的引用、摘要和医学主题词(MeSH)的分类模型。
该模型在2011年留出的数据上实现了受试者操作特征曲线下面积为0.973,均方误差为0.013。在一组经人工审核的测试文章上证实了准确的置信度估计。还创建了一个不需要MeSH词的第二个模型,其表现几乎同样出色。
两个模型都能准确地对文章的RCT置信度进行排名和预测。使用该模型和人工审核的样本估计,在MEDLINE中可额外识别出约8000篇(3%)RCT,并且Medline中标记为RCT的文章可能有5%未被识别。
用连续值的RCT置信度对与人类相关的研究进行重新标注,对于文章排名和综述而言可能比简单的是/否预测更有用。自动化的RCT标注工具在撰写SR的过程中应能显著节省时间和精力,并且是我们正在构建的用于简化SR工作流程的多步骤文本挖掘管道的关键组成部分。此外,该模型可能有助于识别MEDLINE出版物类型中的错误。这里描述的RCT置信度预测已作为一项网络服务提供给用户,其前端有用户查询表单,网址为:http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi。