School of Information Technology and Engineering, University of Ottawa, 800 King Edward, Ottawa, Ontario, Canada K1N 6N5.
Artif Intell Med. 2011 Jan;51(1):17-25. doi: 10.1016/j.artmed.2010.10.005. Epub 2010 Nov 16.
To determine whether the automatic classification of documents can be useful in systematic reviews on medical topics, and specifically if the performance of the automatic classification can be enhanced by using the particular protocol of questions employed by the human reviewers to create multiple classifiers.
The test collection is the data used in large-scale systematic review on the topic of the dissemination strategy of health care services for elderly people. From a group of 47,274 abstracts marked by human reviewers to be included in or excluded from further screening, we randomly selected 20,000 as a training set, with the remaining 27,274 becoming a separate test set. As a machine learning algorithm we used complement naïve Bayes. We tested both a global classification method, where a single classifier is trained on instances of abstracts and their classification (i.e., included or excluded), and a novel per-question classification method that trains multiple classifiers for each abstract, exploiting the specific protocol (questions) of the systematic review. For the per-question method we tested four ways of combining the results of the classifiers trained for the individual questions. As evaluation measures, we calculated precision and recall for several settings of the two methods. It is most important not to exclude any relevant documents (i.e., to attain high recall for the class of interest) but also desirable to exclude most of the non-relevant documents (i.e., to attain high precision on the class of interest) in order to reduce human workload.
For the global method, the highest recall was 67.8% and the highest precision was 37.9%. For the per-question method, the highest recall was 99.2%, and the highest precision was 63%. The human-machine workflow proposed in this paper achieved a recall value of 99.6%, and a precision value of 17.8%.
The per-question method that combines classifiers following the specific protocol of the review leads to better results than the global method in terms of recall. Because neither method is efficient enough to classify abstracts reliably by itself, the technology should be applied in a semi-automatic way, with a human expert still involved. When the workflow includes one human expert and the trained automatic classifier, recall improves to an acceptable level, showing that automatic classification techniques can reduce the human workload in the process of building a systematic review.
确定文献的自动分类是否可用于医学主题的系统评价,特别是通过使用人类审查员创建多个分类器所采用的特定问题协议,是否可以提高自动分类的性能。
测试集是在针对老年人医疗服务传播策略的大型系统评价中使用的数据。从由人类审查员标记为包含或排除进一步筛选的 47274 个摘要中,我们随机选择了 20000 个作为训练集,其余 27274 个成为单独的测试集。作为机器学习算法,我们使用了补充朴素贝叶斯。我们测试了两种分类方法,一种是全局分类方法,即使用单个分类器对摘要及其分类(即包含或排除)进行训练,另一种是新的每问题分类方法,该方法为每个摘要训练多个分类器,利用系统评价的特定协议(问题)。对于每问题方法,我们测试了四种方法来组合为各个问题训练的分类器的结果。作为评估指标,我们计算了两种方法的几种设置的精度和召回率。最重要的是不要排除任何相关文献(即对感兴趣的类达到高召回率),但也希望排除大多数不相关文献(即对感兴趣的类达到高精度),以减少人工工作量。
对于全局方法,最高召回率为 67.8%,最高精度为 37.9%。对于每问题方法,最高召回率为 99.2%,最高精度为 63%。本文提出的人机工作流程实现了 99.6%的召回率和 17.8%的精度。
与全局方法相比,按照审查特定协议结合分类器的每问题方法在召回率方面具有更好的结果。由于没有一种方法本身能够可靠地对摘要进行分类,因此该技术应该以半自动方式应用,仍然需要人类专家的参与。当工作流程包括一名人类专家和训练有素的自动分类器时,召回率会提高到可接受的水平,这表明自动分类技术可以减少系统评价过程中的人工工作量。