Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, 77030, USA.
Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.
The performance of many repeated tasks improves with experience and practice. This improvement tends to be rapid initially and then decreases. The term "learning curve" is often used to describe the phenomenon. In supervised machine learning, the performance of classification algorithms often increases with the number of observations used to train the algorithm. We use progressively larger samples of observations to train the algorithm and then plot performance against the number of training observations. This yields the familiar negatively accelerating learning curve. To quantify the learning curve, we fit inverse power law models to the progressively sampled data. We fit such learning curves to four large clinical cancer genomic datasets, using three classifiers (diagonal linear discriminant analysis, K-nearest-neighbor with three neighbors, and support vector machines) and four values for the number of top genes included (5, 50, 500, 5,000). The inverse power law models fit the progressively sampled data reasonably well and showed considerable diversity when multiple classifiers are applied to the same data. Some classifiers showed rapid and continued increase in performance as the number of training samples increased, while others showed little if any improvement. Assessing classifier efficiency is particularly important in genomic studies since samples are so expensive to obtain. It is important to employ an algorithm that uses the predictive information efficiently, but with a modest number of training samples (>50), learning curves can be used to assess the predictive efficiency of classification algorithms.
许多重复任务的表现随着经验和实践而提高。这种提高最初往往很快,然后逐渐减少。“学习曲线”一词通常用于描述这种现象。在监督机器学习中,分类算法的性能通常随着用于训练算法的观测数量的增加而提高。我们使用逐渐增大的观测样本集来训练算法,然后将性能绘制为训练观测数量的函数。这就得到了熟悉的负加速学习曲线。为了量化学习曲线,我们将逆幂律模型拟合到逐渐采样的数据中。我们使用三个分类器(对角线线性判别分析、三近邻 K 最近邻和支持向量机)和四个包含的基因数量(5、50、500 和 5000),将学习曲线拟合到四个大型临床癌症基因组数据集上。逆幂律模型对逐渐采样的数据拟合得相当好,并且当多个分类器应用于相同的数据时表现出相当大的多样性。一些分类器随着训练样本数量的增加而表现出快速且持续的性能提高,而其他分类器则几乎没有任何改进。在基因组研究中,评估分类器的效率尤为重要,因为获取样本非常昂贵。重要的是使用一种能够有效利用预测信息的算法,但在使用 50 多个训练样本时,学习曲线可用于评估分类算法的预测效率。