Dep. Ing. Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile.
BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.
Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target.
We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method.
A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05).
This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.
监督学习方法需要有标注数据才能生成高效的模型。然而,标注数据相对稀缺,获取成本也很高。无论是被动学习还是主动学习方法,都需要估计达到性能目标所需的标注样本量。
我们设计并实现了一种方法,该方法使用小的标注训练集创建的学习曲线中的点拟合反幂律模型。拟合通过非线性加权最小二乘优化完成。然后,使用拟合模型预测更大样本量的分类器性能和置信区间。为了评估,我们将非线性加权曲线拟合方法应用于使用主动和被动采样方法生成的一组学习曲线,并使用标准拟合优度度量对预测进行验证。作为对照,我们使用了未加权的拟合方法。
共拟合了 568 个模型,并将模型预测与观察到的性能进行了比较。根据数据集和采样方法的不同,需要 80 到 560 个标注样本才能使平均均方误差和根均方误差低于 0.01。结果还表明,我们的加权拟合方法优于基线未加权方法(p<0.05)。
本文描述了一种简单有效的样本量预测算法,它对学习曲线进行加权拟合。该算法优于以前文献中描述的未加权算法。它可以帮助研究人员确定监督机器学习的标注样本量。