不同生物数据集上的特征选择与分类器性能

Feature selection and classifier performance on diverse bio- logical datasets.

作者信息

Hemphill Edward, Lindsay James, Lee Chih, Măndoiu Ion I, Nelson Craig E

出版信息

BMC Bioinformatics. 2014;15 Suppl 13(Suppl 13):S4. doi: 10.1186/1471-2105-15-S13-S4. Epub 2014 Nov 13.

DOI:10.1186/1471-2105-15-S13-S4

PMID:25434802

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4248652/

Abstract

BACKGROUND

There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types.

RESULTS

This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness.

CONCLUSIONS

As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data.

摘要

背景

用于研究和临床应用的能生成大量生物标志物的技术范围在不断扩大。从高维数据集中选择最具信息量的生物标志物，并结合识别与该生物标志物集一起使用的最可靠、准确的分类算法，可能是一项艰巨的任务。现有的特征选择和分类算法调查通常集中于单一数据类型，如基因表达微阵列，很少探讨模型在多种生物数据类型上的性能。

结果

本文展示了一项大规模实证研究的结果，其中使用了大量流行的特征选择和分类算法来识别NCI-60癌细胞系的起源组织。实施了一个计算流程，以在NCI-60细胞系可用的五种不同数据类型上，在所有参数下最大化所有模型的预测准确性。使用外部数据进行了验证实验，以证明稳健性。

结论

正如预期的那样，数据类型和生物标志物数量对预测模型的性能有显著影响。尽管在整个测试的标志物数量范围内，没有一个模型或数据类型始终优于其他模型或数据类型，但有几个明显的趋势是可见的。在生物标志物数量较少时，基因和蛋白质表达数据类型能够比其他三种数据类型（即单核苷酸多态性、阵列比较基因组杂交（aCGH）和微小RNA数据）更好地区分癌细胞系。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44ba/4248652/ced3bf4c2924/1471-2105-15-S13-S4-1.jpg

相似文献

Feature selection and classifier performance on diverse bio- logical datasets.不同生物数据集上的特征选择与分类器性能

BMC Bioinformatics. 2014;15 Suppl 13(Suppl 13):S4. doi: 10.1186/1471-2105-15-S13-S4. Epub 2014 Nov 13.

FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number.FSR：基于拷贝数的可扩展和准确的多类癌症亚型分类的特征集缩减。

Bioinformatics. 2012 Jan 15;28(2):151-9. doi: 10.1093/bioinformatics/btr644. Epub 2011 Nov 21.

The feature selection bias problem in relation to high-dimensional gene data.与高维基因数据相关的特征选择偏差问题。

Artif Intell Med. 2016 Jan;66:63-71. doi: 10.1016/j.artmed.2015.11.001. Epub 2015 Nov 14.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.基于集成特征选择方法的癌症诊断稳健生物标志物识别。

Bioinformatics. 2010 Feb 1;26(3):392-8. doi: 10.1093/bioinformatics/btp630. Epub 2009 Nov 25.

Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

Using fuzzy association rule mining in cancer classification.在癌症分类中使用模糊关联规则挖掘。

Australas Phys Eng Sci Med. 2011 Apr;34(1):41-54. doi: 10.1007/s13246-011-0054-8. Epub 2011 Feb 16.

DNA Copy Number Selection Using Robust Structured Sparsity-Inducing Norms.使用稳健的结构化稀疏诱导范数进行DNA拷贝数选择

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):168-81. doi: 10.1109/TCBB.2013.141.

Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage.基于微阵列数据的拉普拉斯朴素贝叶斯模型均值收缩的生物标志物识别和癌症分类。

IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1649-62. doi: 10.1109/TCBB.2012.105.

Biomarker discovery based on BBHA and AdaboostM1 on microarray data for cancer classification.基于BBHA和AdaboostM1的微阵列数据用于癌症分类的生物标志物发现。

Annu Int Conf IEEE Eng Med Biol Soc. 2016 Aug;2016:3080-3083. doi: 10.1109/EMBC.2016.7591380.

TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection.TSG：一种用于二分类和多分类癌症分类及信息基因选择的新算法。

BMC Med Genomics. 2013;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1755-8794-6-S1-S3. Epub 2013 Jan 23.

引用本文的文献

RadiomiX for Radiomics Analysis: Automated Approaches to Overcome Challenges in Replicability.用于放射组学分析的RadiomiX：克服可重复性挑战的自动化方法。

Diagnostics (Basel). 2025 Aug 5;15(15):1968. doi: 10.3390/diagnostics15151968.

Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks.急性髓系白血病基因组特征研究及基于特征选择和贝叶斯网络的亚型分类

Biomedicines. 2025 Apr 28;13(5):1067. doi: 10.3390/biomedicines13051067.

pmiRScan: a LightGBM based method for prediction of animal pre-miRNAs.pmiRScan：一种基于LightGBM的动物前体微小RNA预测方法。

Funct Integr Genomics. 2025 Jan 9;25(1):9. doi: 10.1007/s10142-025-01527-y.

ClearF++: Improved Supervised Feature Scoring Using Feature Clustering in Class-Wise Embedding and Reconstruction.ClearF++：在类内嵌入和重构中使用特征聚类改进监督特征评分

Bioengineering (Basel). 2023 Jul 10;10(7):824. doi: 10.3390/bioengineering10070824.

Triku: a feature selection method based on nearest neighbors for single-cell data.Triku：一种基于最近邻的单细胞数据分析特征选择方法。

Gigascience. 2022 Mar 12;11. doi: 10.1093/gigascience/giac017.

Predictive Metagenomic Analysis of Autoimmune Disease Identifies Robust Autoimmunity and Disease Specific Microbial Signatures.自身免疫性疾病的预测性宏基因组分析确定了强大的自身免疫和疾病特异性微生物特征。

Front Microbiol. 2021 Mar 4;12:621310. doi: 10.3389/fmicb.2021.621310. eCollection 2021.

Altered Hippocampal Epigenetic Regulation Underlying Reduced Cognitive Development in Response to Early Life Environmental Insults.早期生活环境刺激导致认知发育迟缓的海马表观遗传调控改变。

Genes (Basel). 2020 Feb 4;11(2):162. doi: 10.3390/genes11020162.

Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data.利用大规模基因数据进行管道设计，以识别关键特征并对肺癌患者的化疗反应进行分类。

BMC Syst Biol. 2018 Nov 20;12(Suppl 5):97. doi: 10.1186/s12918-018-0615-5.

Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma.基于机器学习的特定基因突变与肺腺癌患者生存的相关性研究。

PLoS One. 2018 Nov 12;13(11):e0207204. doi: 10.1371/journal.pone.0207204. eCollection 2018.

Radiological Image Traits Predictive of Cancer Status in Pulmonary Nodules.预测肺结节癌症状态的放射学图像特征

Clin Cancer Res. 2017 Mar 15;23(6):1442-1449. doi: 10.1158/1078-0432.CCR-15-3102. Epub 2016 Sep 23.

本文引用的文献

Genome-wide discovery of genetic variants affecting tamoxifen sensitivity and their clinical and functional validation.全基因组发现影响他莫昔芬敏感性的遗传变异及其临床和功能验证。

Ann Oncol. 2013 Jul;24(7):1867-1873. doi: 10.1093/annonc/mdt125. Epub 2013 Mar 18.

CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set.CellMiner：一套基于网络的基因组学和药理学工具套件，用于探索 NCI-60 细胞系集中的转录组和药物模式。

Cancer Res. 2012 Jul 15;72(14):3499-511. doi: 10.1158/0008-5472.CAN-12-1370.

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures.特征选择方法对分子特征准确性、稳定性和可解释性的影响。

PLoS One. 2011;6(12):e28210. doi: 10.1371/journal.pone.0028210. Epub 2011 Dec 21.

Risk stratification and clinical outcomes in patients with acute pulmonary embolism.急性肺栓塞患者的风险分层和临床结局。

Clin Biochem. 2011 Sep;44(13):1110-1115. doi: 10.1016/j.clinbiochem.2011.06.077. Epub 2011 Jun 24.

A whole-genome SNP association study of NCI60 cell line panel indicates a role of Ca2+ signaling in selenium resistance.全基因组 SNP 关联研究表明 NCI60 细胞系panel 中钙信号在硒抗性中起作用。

PLoS One. 2010 Sep 7;5(9):e12601. doi: 10.1371/journal.pone.0012601.

k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction.k-近邻模型在基因表达微阵列分析和临床结局预测中的应用。

Pharmacogenomics J. 2010 Aug;10(4):292-309. doi: 10.1038/tpj.2010.56.

mRNA and microRNA expression profiles of the NCI-60 integrated with drug activities.NCI-60 综合药物活性的 mRNA 和 microRNA 表达谱。

Mol Cancer Ther. 2010 May;9(5):1080-91. doi: 10.1158/1535-7163.MCT-09-0965. Epub 2010 May 4.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.基于集成特征选择方法的癌症诊断稳健生物标志物识别。

Bioinformatics. 2010 Feb 1;26(3):392-8. doi: 10.1093/bioinformatics/btp630. Epub 2009 Nov 25.

Pitfalls of supervised feature selection.监督式特征选择的陷阱。

Bioinformatics. 2010 Feb 1;26(3):440-3. doi: 10.1093/bioinformatics/btp621. Epub 2009 Oct 29.

Balanced gradient boosting from imbalanced data for clinical outcome prediction.用于临床结果预测的不平衡数据的平衡梯度提升法

Stat Appl Genet Mol Biol. 2009;8:Article20. doi: 10.2202/1544-6115.1422. Epub 2009 Apr 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

不同生物数据集上的特征选择与分类器性能

Feature selection and classifier performance on diverse bio- logical datasets.

作者信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献