Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile.
J Am Med Inform Assoc. 2012 Sep-Oct;19(5):809-16. doi: 10.1136/amiajnl-2011-000648. Epub 2012 Jun 15.
This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks.
Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results.
Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences.
The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm.
For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.
本研究探讨了主动学习算法在医学文本分类任务中减少对大型训练集需求的方法。
使用三种现有的主动学习算法(基于距离的(DIST)、基于多样性的(DIV)和两者的组合(CMB))对来自五个数据集的文本进行分类。将这些算法的性能与五个数据集上的被动学习进行了比较。然后,我们对数据集特征与性能结果之间的交互作用进行了新的研究。
对于每个算法在不同样本量下的分类准确性和接收器工作特征(ROC)曲线下的面积,生成了不同算法的结果。使用配对差异的加权平均值比较主动学习算法和被动学习的性能。为了确定为什么在不同的数据集中性能会有所不同,我们使用相对熵测量每个数据集的多样性和不确定性,并将结果与性能差异相关联。
DIST 和 CMB 算法的性能优于被动学习。在设定的统计显著性水平为 0.05 的情况下,DIST 在所有五个数据集中均优于被动学习,而 CMB 在四个数据集中均优于被动学习。我们发现数据集多样性与 DIV 性能之间以及数据集不确定性与 DIST 算法性能之间存在很强的相关性。
对于医学文本分类,适当的主动学习算法可以在使用较小的训练集的情况下获得与被动学习相当的性能。特别是,我们的结果表明,在多样性较高的数据上,DIV 的性能更好,而在不确定性较低的数据上,DIST 的性能更好。