一种用于改进基因选择以对基因表达数据进行分类的阻断策略。

A blocking strategy to improve gene selection for classification of gene expression data.

作者信息

Bontempi Gianluca

机构信息

Départment d'Informative,Université Libre de Bruxelles, Bruxelles, Belgium.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2007 Apr-Jun;4(2):293-300. doi: 10.1109/TCBB.2007.1014.

DOI:10.1109/TCBB.2007.1014

PMID:17473321

Abstract

Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.

摘要

由于维度较高，机器学习算法通常依赖特征选择技术，以便在微阵列基因表达数据集中进行有效的分类。然而，与样本数量相比，特征数量众多使得特征选择任务在计算上变得困难且容易出错。本文将特征选择解释为一个随机优化任务，其目标是在指数数量的替代基因子集中选择一个预期在分类中具有最高泛化能力的子集。分块是一种实验设计策略，它产生相似的实验条件以比较替代随机配置，从而确信观察到的准确性差异是由于实际差异而非波动和噪声效应。我们提出一种用于改进特征选择的原创分块策略，该策略以配对方式聚合几种学习算法的验证结果，以评估一个基因子集并将其与其他子集进行比较。这相对于传统包装器来说是一个新颖之处，传统包装器通常采用单一学习算法来评估给定变量集的相关性。该方法的基本原理是，通过增加验证特征子集的实验条件数量，我们可以减少与样本稀缺相关的问题，从而得出更好的选择。本文表明，对于一组16个公开可用的癌症表达数据集，分块策略显著提高了传统前向选择的性能。实验涉及六种不同的分类器，结果表明，改进的发生与选择步骤后使用的分类算法无关。基于可用生物学注释的另外两项验证支持了特征选择中的分块策略可能提高解决方案的准确性和质量这一说法。第一次验证基于检索与所选基因相关的PubMEd摘要，并将它们与描述表达数据集背后生物学现象的正则表达式进行匹配。随后的生物学验证基于使用Bioconductor包GoStats来进行基因本体统计分析。

相似文献

A blocking strategy to improve gene selection for classification of gene expression data.

IEEE/ACM Trans Comput Biol Bioinform. 2007 Apr-Jun;4(2):293-300. doi: 10.1109/TCBB.2007.1014.

A novel feature selection approach for biomedical data classification.

J Biomed Inform. 2010 Feb;43(1):15-23. doi: 10.1016/j.jbi.2009.07.008. Epub 2009 Jul 30.

Feature selection and nearest centroid classification for protein mass spectrometry.

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

What should be expected from feature selection in small-sample settings.

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Detecting reliable gene interactions by a hierarchy of Bayesian network classifiers.

Comput Methods Programs Biomed. 2008 Aug;91(2):110-21. doi: 10.1016/j.cmpb.2008.02.010. Epub 2008 Apr 22.

Robust feature selection for microarray data based on multicriterion fusion.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jul-Aug;8(4):1080-92. doi: 10.1109/TCBB.2010.103.

Ensemble gene selection by grouping for microarray data classification.

J Biomed Inform. 2010 Feb;43(1):81-7. doi: 10.1016/j.jbi.2009.08.010. Epub 2009 Aug 20.

Validating module network learning algorithms using simulated data.

BMC Bioinformatics. 2007 May 3;8 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-8-S2-S5.

Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis.

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):365-81. doi: 10.1109/TCBB.2007.70224.

Guilt-by-association feature selection: identifying biomarkers from proteomic profiles.

J Biomed Inform. 2008 Feb;41(1):124-36. doi: 10.1016/j.jbi.2007.04.003. Epub 2007 Apr 14.

引用本文的文献

A genetic ensemble approach for gene-gene interaction identification.

BMC Bioinformatics. 2010 Oct 21;11:524. doi: 10.1186/1471-2105-11-524.

A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data.

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S5. doi: 10.1186/1471-2105-11-S1-S5.

A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all?

Bioinformatics. 2008 Oct 1;24(19):2200-8. doi: 10.1093/bioinformatics/btn374. Epub 2008 Jul 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于改进基因选择以对基因表达数据进行分类的阻断策略。

A blocking strategy to improve gene selection for classification of gene expression data.

作者信息

Bontempi Gianluca

机构信息

Départment d'Informative,Université Libre de Bruxelles, Bruxelles, Belgium.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2007 Apr-Jun;4(2):293-300. doi: 10.1109/TCBB.2007.1014.

DOI:10.1109/TCBB.2007.1014

PMID:17473321

Abstract

摘要

一种用于改进基因选择以对基因表达数据进行分类的阻断策略。

A blocking strategy to improve gene selection for classification of gene expression data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

一种用于改进基因选择以对基因表达数据进行分类的阻断策略。

A blocking strategy to improve gene selection for classification of gene expression data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献