Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, TN, USA.
J Biomed Inform. 2012 Apr;45(2):265-72. doi: 10.1016/j.jbi.2011.11.003. Epub 2011 Nov 22.
Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC-0.7715) than the passive learning method (random sampling) (ALC-0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort.
监督机器学习方法在临床自然语言处理 (NLP) 研究中需要大量的标注样本,但由于涉及医生,这些样本的构建成本非常高。主动学习是一种从大量样本中主动采样的方法,提供了一种替代解决方案。它在分类中的主要目标是在保持预测模型质量的同时,减少标注工作。然而,很少有研究调查其在临床 NLP 中的应用。本文报告了主动学习在临床文本分类任务中的应用:确定临床概念的断言状态。该研究使用了 2010 年 i2b2/VA 临床 NLP 挑战赛中的断言分类任务的标注语料库。我们实现了几种现有的和新开发的主动学习算法,并评估了它们的使用效果。结果以基于 AUC(曲线下面积)分数的平均学习曲线的平均 AUC(曲线下面积)分数的全局 ALC 得分报告。结果表明,当使用相同数量的标注样本时,主动学习策略可以生成比被动学习方法(随机采样)更好的分类模型(最佳 ALC-0.7715)(ALC-0.7411)。此外,为了达到相同的分类性能,主动学习策略所需的样本数量少于随机采样方法。例如,要达到 AUC 为 0.79,随机采样方法需要 32 个样本,而我们最好的主动学习算法仅需要 12 个样本,手动标注工作减少了 62.5%。