训练样本量和分类难度对基因组预测器准确性的影响。

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

机构信息

Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland.

出版信息

Breast Cancer Res. 2010;12(1):R5. doi: 10.1186/bcr2468. Epub 2010 Jan 11.

DOI:10.1186/bcr2468

PMID:20064235

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2880423/

Abstract

INTRODUCTION

As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.

METHODS

We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.

RESULTS

A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.

CONCLUSIONS

We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

摘要

简介

作为 MicroArray Quality Control（MAQC）-II 项目的一部分，本分析研究了在不同预测难度程度下（由三个临床相关终点表示），选择单变量特征选择方法和分类算法如何影响基因组预测器的性能。

方法

我们使用了来自 230 个乳腺癌的基因表达数据（分为训练集和独立验证集），并针对每个三个终点，检查了 40 个预测器（五种单变量特征选择方法与八种不同的分类器结合）。使用两种不同的重采样方法在训练集上估计了它们的分类性能，并与独立验证集观察到的准确性进行了比较。

结果

获得了三个分类问题的排名，并在独立验证集上估计和评估了 120 个模型的性能。与交叉验证估计相比，引导估计更接近验证性能。估计了每个终点所需的样本量，并对获得的模型进行了基因水平和途径水平的分析。

结论

我们表明基因组预测器的准确性主要由样本量和分类难度之间的相互作用决定。单变量特征选择方法的变化和分类算法的选择对预测器性能的影响仅适度，对于任何给定的分类问题都可以开发出几个在统计学上同样好的预测器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c1f9/2880423/69e6f4d02531/bcr2468-1.jpg

相似文献

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.训练样本量和分类难度对基因组预测器准确性的影响。

Breast Cancer Res. 2010;12(1):R5. doi: 10.1186/bcr2468. Epub 2010 Jan 11.

Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data.MAQC-II 乳腺癌和多发性骨髓瘤基因表达数据的特征选择和分类。

PLoS One. 2009 Dec 11;4(12):e8250. doi: 10.1371/journal.pone.0008250.

Comparison of performance of one-color and two-color gene-expression analyses in predicting clinical endpoints of neuroblastoma patients.比较单色彩基因表达分析和双色彩基因表达分析在预测神经母细胞瘤患者临床终点的性能。

Pharmacogenomics J. 2010 Aug;10(4):258-66. doi: 10.1038/tpj.2010.53.

A hybrid feature selection method for DNA microarray data.一种用于 DNA 微阵列数据的混合特征选择方法。

Comput Biol Med. 2011 Apr;41(4):228-37. doi: 10.1016/j.compbiomed.2011.02.004. Epub 2011 Mar 3.

Selecting a single model or combining multiple models for microarray-based classifier development?--a comparative analysis based on large and diverse datasets generated from the MAQC-II project.基于 MAQC-II 项目生成的大型且多样化数据集的比较分析：选择单个模型还是组合多个模型用于基于微阵列的分类器开发？

BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S3. doi: 10.1186/1471-2105-12-S10-S3.

One-step extrapolation of the prediction performance of a gene signature derived from a small study.从小规模研究得出的基因特征预测性能的一步外推法。

BMJ Open. 2015 Apr 17;5(4):e007170. doi: 10.1136/bmjopen-2014-007170.

Superior feature-set ranking for small samples using bolstered error estimation.使用增强误差估计对小样本进行卓越的特征集排序。

Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.

Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer.乳腺癌对紫杉醇、氟尿嘧啶、阿霉素和环磷酰胺术前化疗敏感性的药物基因组学预测指标

J Clin Oncol. 2006 Sep 10;24(26):4236-44. doi: 10.1200/JCO.2006.05.6861. Epub 2006 Aug 8.

A comparative study of different machine learning methods on microarray gene expression data.不同机器学习方法对微阵列基因表达数据的比较研究。

BMC Genomics. 2008;9 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2164-9-S1-S13.

Genetic test bed for feature selection.用于特征选择的基因测试平台。

Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.

引用本文的文献

KDM4C inhibition blocks tumor growth in basal breast cancer by promoting cathepsin L-mediated histone H3 cleavage.KDM4C抑制通过促进组织蛋白酶L介导的组蛋白H3切割来阻断基底样乳腺癌的肿瘤生长。

Nat Genet. 2025 Jun 2. doi: 10.1038/s41588-025-02197-z.

Individualized dynamic risk assessment and treatment selection for multiple myeloma.多发性骨髓瘤的个体化动态风险评估与治疗选择

Br J Cancer. 2025 Jun;132(10):922-936. doi: 10.1038/s41416-025-02987-6. Epub 2025 Apr 1.

Exploration of the Prognostic Markers of Multiple Myeloma Based on Cuproptosis-Related Genes.基于铜死亡相关基因的多发性骨髓瘤预后标志物探索

Cancer Rep (Hoboken). 2025 Mar;8(3):e70151. doi: 10.1002/cnr2.70151.

Ternary Complex Components Responsible for Rapid LDL Internalization as Biomarkers for Breast Cancer Associated with Proliferation and Early Recurrence.负责快速低密度脂蛋白内化的三元复合物成分作为与增殖和早期复发相关的乳腺癌生物标志物。

Cancer Res Commun. 2025 Feb 1;5(2):226-239. doi: 10.1158/2767-9764.CRC-23-0562.

m1A regulator‑mediated methylation modifications and gene signatures and their prognostic value in multiple myeloma.m1A调节因子介导的甲基化修饰、基因特征及其在多发性骨髓瘤中的预后价值

Exp Ther Med. 2024 Nov 18;29(1):18. doi: 10.3892/etm.2024.12768. eCollection 2025 Jan.

ZBTB7A is a modulator of KDM5-driven transcriptional networks in basal breast cancer.ZBTB7A是基底样乳腺癌中KDM5驱动的转录网络的调节剂。

Cell Rep. 2024 Dec 24;43(12):114991. doi: 10.1016/j.celrep.2024.114991. Epub 2024 Nov 20.

GSDME-mediated pyroptosis promotes anti-tumor immunity of neoadjuvant chemotherapy in breast cancer.GSDME 介导的细胞焦亡促进乳腺癌新辅助化疗的抗肿瘤免疫。

Cancer Immunol Immunother. 2024 Jul 2;73(9):177. doi: 10.1007/s00262-024-03752-z.

Head and neck cancer of unknown primary: unveiling primary tumor sites through machine learning on DNA methylation profiles.头颈部不明原发癌：通过 DNA 甲基化谱的机器学习揭示原发肿瘤部位。

Clin Epigenetics. 2024 Mar 25;16(1):47. doi: 10.1186/s13148-024-01657-3.

Notch-based gene signature for predicting the response to neoadjuvant chemotherapy in triple-negative breast cancer.基于 Notch 的基因标志物预测三阴性乳腺癌新辅助化疗的反应。

J Transl Med. 2023 Nov 15;21(1):811. doi: 10.1186/s12967-023-04713-3.

and emerge as pivotal predictors of resistance to neoadjuvant chemotherapy in ER+/HER2- breast cancer.并成为雌激素受体阳性/人表皮生长因子受体2阴性乳腺癌新辅助化疗耐药的关键预测指标。

Front Oncol. 2023 Aug 28;13:1216438. doi: 10.3389/fonc.2023.1216438. eCollection 2023.

本文引用的文献

An empirical study of univariate and genetic algorithm-based feature selection in binary classification with microarray data.基于单变量和遗传算法的微阵列数据二元分类特征选择的实证研究。

Cancer Inform. 2007 Feb 23;2:313-27.

Prognostic gene signatures for non-small-cell lung cancer.非小细胞肺癌的预后基因特征

Proc Natl Acad Sci U S A. 2009 Feb 24;106(8):2824-8. doi: 10.1073/pnas.0809444106. Epub 2009 Feb 5.

Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures.乳腺癌基因表达谱的荟萃分析：旨在对乳腺癌亚型和预后特征达成统一认识。

Breast Cancer Res. 2008;10(4):R65. doi: 10.1186/bcr2124. Epub 2008 Jul 28.

Commercialized multigene predictors of clinical outcome for breast cancer.用于乳腺癌临床结局的商业化多基因预测指标

Oncologist. 2008 May;13(5):477-93. doi: 10.1634/theoncologist.2007-0248.

Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods.比较单变量和多变量分类方法得出的基因表达谱特征。

Stat Appl Genet Mol Biol. 2008;7(1):Article7. doi: 10.2202/1544-6115.1307. Epub 2008 Feb 23.

Response to neoadjuvant therapy and long-term survival in patients with triple-negative breast cancer.三阴性乳腺癌患者对新辅助治疗的反应及长期生存情况

J Clin Oncol. 2008 Mar 10;26(8):1275-81. doi: 10.1200/JCO.2007.14.4147. Epub 2008 Feb 4.

The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists.DAVID基因功能分类工具：一种以生物模块为中心的新型算法，用于对大型基因列表进行功能分析。

Genome Biol. 2007;8(9):R183. doi: 10.1186/gb-2007-8-9-r183.

Thirty-gene pharmacogenomic test correlates with residual cancer burden after preoperative chemotherapy for breast cancer.30基因药物基因组检测与乳腺癌术前化疗后的残余癌负担相关。

Clin Cancer Res. 2007 Jul 15;13(14):4078-82. doi: 10.1158/1078-0432.CCR-06-2600.

HER2 expression and efficacy of preoperative paclitaxel/FAC chemotherapy in breast cancer.HER2表达与术前紫杉醇/FAC化疗在乳腺癌中的疗效

Breast Cancer Res Treat. 2008 Mar;108(2):183-90. doi: 10.1007/s10549-007-9594-8. Epub 2007 Apr 28.

Classification based upon gene expression data: bias and precision of error rates.基于基因表达数据的分类：错误率的偏差与精度

Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

训练样本量和分类难度对基因组预测器准确性的影响。

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSIONS

简介

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献