Suppr超能文献

基于微阵列数据的分类学习曲线。

Learning curves in classification with microarray data.

机构信息

Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, 77030, USA.

出版信息

Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.

Abstract

The performance of many repeated tasks improves with experience and practice. This improvement tends to be rapid initially and then decreases. The term "learning curve" is often used to describe the phenomenon. In supervised machine learning, the performance of classification algorithms often increases with the number of observations used to train the algorithm. We use progressively larger samples of observations to train the algorithm and then plot performance against the number of training observations. This yields the familiar negatively accelerating learning curve. To quantify the learning curve, we fit inverse power law models to the progressively sampled data. We fit such learning curves to four large clinical cancer genomic datasets, using three classifiers (diagonal linear discriminant analysis, K-nearest-neighbor with three neighbors, and support vector machines) and four values for the number of top genes included (5, 50, 500, 5,000). The inverse power law models fit the progressively sampled data reasonably well and showed considerable diversity when multiple classifiers are applied to the same data. Some classifiers showed rapid and continued increase in performance as the number of training samples increased, while others showed little if any improvement. Assessing classifier efficiency is particularly important in genomic studies since samples are so expensive to obtain. It is important to employ an algorithm that uses the predictive information efficiently, but with a modest number of training samples (>50), learning curves can be used to assess the predictive efficiency of classification algorithms.

摘要

许多重复任务的表现随着经验和实践而提高。这种提高最初往往很快,然后逐渐减少。“学习曲线”一词通常用于描述这种现象。在监督机器学习中,分类算法的性能通常随着用于训练算法的观测数量的增加而提高。我们使用逐渐增大的观测样本集来训练算法,然后将性能绘制为训练观测数量的函数。这就得到了熟悉的负加速学习曲线。为了量化学习曲线,我们将逆幂律模型拟合到逐渐采样的数据中。我们使用三个分类器(对角线线性判别分析、三近邻 K 最近邻和支持向量机)和四个包含的基因数量(5、50、500 和 5000),将学习曲线拟合到四个大型临床癌症基因组数据集上。逆幂律模型对逐渐采样的数据拟合得相当好,并且当多个分类器应用于相同的数据时表现出相当大的多样性。一些分类器随着训练样本数量的增加而表现出快速且持续的性能提高,而其他分类器则几乎没有任何改进。在基因组研究中,评估分类器的效率尤为重要,因为获取样本非常昂贵。重要的是使用一种能够有效利用预测信息的算法,但在使用 50 多个训练样本时,学习曲线可用于评估分类算法的预测效率。

相似文献

1
Learning curves in classification with microarray data.基于微阵列数据的分类学习曲线。
Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.
6
MLSeq: Machine learning interface for RNA-sequencing data.MLSeq:用于 RNA-seq 数据的机器学习接口。
Comput Methods Programs Biomed. 2019 Jul;175:223-231. doi: 10.1016/j.cmpb.2019.04.007. Epub 2019 Apr 29.
9
Predicting sample size required for classification performance.预测分类性能所需的样本量。
BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验