• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于微阵列数据的分类学习曲线。

Learning curves in classification with microarray data.

机构信息

Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, 77030, USA.

出版信息

Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.

DOI:10.1053/j.seminoncol.2009.12.002
PMID:20172367
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4482113/
Abstract

The performance of many repeated tasks improves with experience and practice. This improvement tends to be rapid initially and then decreases. The term "learning curve" is often used to describe the phenomenon. In supervised machine learning, the performance of classification algorithms often increases with the number of observations used to train the algorithm. We use progressively larger samples of observations to train the algorithm and then plot performance against the number of training observations. This yields the familiar negatively accelerating learning curve. To quantify the learning curve, we fit inverse power law models to the progressively sampled data. We fit such learning curves to four large clinical cancer genomic datasets, using three classifiers (diagonal linear discriminant analysis, K-nearest-neighbor with three neighbors, and support vector machines) and four values for the number of top genes included (5, 50, 500, 5,000). The inverse power law models fit the progressively sampled data reasonably well and showed considerable diversity when multiple classifiers are applied to the same data. Some classifiers showed rapid and continued increase in performance as the number of training samples increased, while others showed little if any improvement. Assessing classifier efficiency is particularly important in genomic studies since samples are so expensive to obtain. It is important to employ an algorithm that uses the predictive information efficiently, but with a modest number of training samples (>50), learning curves can be used to assess the predictive efficiency of classification algorithms.

摘要

许多重复任务的表现随着经验和实践而提高。这种提高最初往往很快,然后逐渐减少。“学习曲线”一词通常用于描述这种现象。在监督机器学习中,分类算法的性能通常随着用于训练算法的观测数量的增加而提高。我们使用逐渐增大的观测样本集来训练算法,然后将性能绘制为训练观测数量的函数。这就得到了熟悉的负加速学习曲线。为了量化学习曲线,我们将逆幂律模型拟合到逐渐采样的数据中。我们使用三个分类器(对角线线性判别分析、三近邻 K 最近邻和支持向量机)和四个包含的基因数量(5、50、500 和 5000),将学习曲线拟合到四个大型临床癌症基因组数据集上。逆幂律模型对逐渐采样的数据拟合得相当好,并且当多个分类器应用于相同的数据时表现出相当大的多样性。一些分类器随着训练样本数量的增加而表现出快速且持续的性能提高,而其他分类器则几乎没有任何改进。在基因组研究中,评估分类器的效率尤为重要,因为获取样本非常昂贵。重要的是使用一种能够有效利用预测信息的算法,但在使用 50 多个训练样本时,学习曲线可用于评估分类算法的预测效率。

相似文献

1
Learning curves in classification with microarray data.基于微阵列数据的分类学习曲线。
Semin Oncol. 2010 Feb;37(1):65-8. doi: 10.1053/j.seminoncol.2009.12.002.
2
Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data.遗传算法和构造神经网络在微阵列癌症数据分析中的应用。
Theor Biol Med Model. 2014 May 7;11 Suppl 1(Suppl 1):S7. doi: 10.1186/1742-4682-11-S1-S7.
3
Comparison of feature selection and classification for MALDI-MS data.基质辅助激光解吸电离飞行时间质谱(MALDI-MS)数据的特征选择与分类比较
BMC Genomics. 2009 Jul 7;10 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2164-10-S1-S3.
4
Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification.用于癌症微阵列数据分类的分层基因选择与遗传模糊系统
PLoS One. 2015 Mar 30;10(3):e0120364. doi: 10.1371/journal.pone.0120364. eCollection 2015.
5
Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data.MAQC-II 乳腺癌和多发性骨髓瘤基因表达数据的特征选择和分类。
PLoS One. 2009 Dec 11;4(12):e8250. doi: 10.1371/journal.pone.0008250.
6
MLSeq: Machine learning interface for RNA-sequencing data.MLSeq:用于 RNA-seq 数据的机器学习接口。
Comput Methods Programs Biomed. 2019 Jul;175:223-231. doi: 10.1016/j.cmpb.2019.04.007. Epub 2019 Apr 29.
7
Training based on ligand efficiency improves prediction of bioactivities of ligands and drug target proteins in a machine learning approach.基于配体效率的训练可以提高机器学习方法中配体和药物靶标蛋白生物活性预测的准确性。
J Chem Inf Model. 2013 Oct 28;53(10):2525-37. doi: 10.1021/ci400240u. Epub 2013 Sep 24.
8
Regularized Least Squares Cancer classifiers from DNA microarray data.基于DNA微阵列数据的正则化最小二乘癌症分类器。
BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-6-S4-S2.
9
Predicting sample size required for classification performance.预测分类性能所需的样本量。
BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.
10
A review of image analysis and machine learning techniques for automated cervical cancer screening from pap-smear images.基于巴氏涂片图像的宫颈癌自动筛查的图像分析和机器学习技术综述。
Comput Methods Programs Biomed. 2018 Oct;164:15-22. doi: 10.1016/j.cmpb.2018.05.034. Epub 2018 Jun 26.

引用本文的文献

1
Performance reserves in brain-imaging-based phenotype prediction.基于脑影像的表型预测中的性能储备。
Cell Rep. 2024 Jan 23;43(1):113597. doi: 10.1016/j.celrep.2023.113597. Epub 2023 Dec 29.
2
Radio-pathomic Maps of Epithelium and Lumen Density Predict the Location of High-Grade Prostate Cancer.上皮和管腔密度的放射组学图谱可预测高级别前列腺癌的位置。
Int J Radiat Oncol Biol Phys. 2018 Aug 1;101(5):1179-1187. doi: 10.1016/j.ijrobp.2018.04.044. Epub 2018 Apr 24.
3
Predicting sample size required for classification performance.预测分类性能所需的样本量。
BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.
4
Addressing the challenge of defining valid proteomic biomarkers and classifiers.解决定义有效蛋白质组生物标志物和分类器的挑战。
BMC Bioinformatics. 2010 Dec 10;11:594. doi: 10.1186/1471-2105-11-594.

本文引用的文献

1
Estimating dataset size requirements for classifying DNA microarray data.估计用于DNA微阵列数据分类的数据集大小要求。
J Comput Biol. 2003;10(2):119-42. doi: 10.1089/106652703321825928.
2
Statistical assessment of the learning curves of health technologies.卫生技术学习曲线的统计评估。
Health Technol Assess. 2001;5(12):1-79. doi: 10.3310/hta5120.