利用基因本体论进行基因表达分类的特征选择方法分析的仿真。

A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification.

机构信息

Dept. of Computer Science and Engineering, Oakland University, 2200 N Squirrel Rd, Rochester, MI 48309, United States.

出版信息

J Biomed Inform. 2013 Dec;46(6):1044-59. doi: 10.1016/j.jbi.2013.07.008. Epub 2013 Jul 25.

DOI:10.1016/j.jbi.2013.07.008

Abstract

Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G) ≥ 0.696; (2) it decreases classification accuracy when BCAL(G) ≤ 0.389; (3) it provides marginal accuracy improvement when 0.389<BCAL(G)<0.696 and δ<1; (4) as the number of genes in a biological condition increases beyond 50 and δ ≥ 0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.

摘要

基因表达谱分类是一项重要的研究领域，有助于从传统医学向个性化医学的转变。基因表达数据分类面临的一个主要挑战是相对于大量基因而言，样本数量较少。为了解决这个问题，研究人员设计了各种特征选择算法来减少基因的数量。最近的研究一直在尝试使用基因本体论（GO）中基因之间的语义相似性作为一种改进特征选择的方法。虽然有一些研究讨论了如何使用 GO 进行特征选择，但没有模拟研究解决何时使用基于 GO 的特征选择。为了研究这个问题，我们开发了一种新的模拟方法，该方法生成二进制类数据集，其中两个类之间的差异表达基因在 GO 中具有某种潜在关系。这使我们能够研究各种因素的影响，例如 GO 中潜在基因的相对连通性、差异表达基因之间的分离程度（表示为δ）以及训练样本的数量。我们的模拟结果表明，生物条件下差异表达基因的 GO 连通性是决定基于 GO 的特征选择效果的主要因素。具体来说，随着差异表达基因连通性的增加，分类准确性的提高也会增加。为了量化这种连通性的概念，我们定义了一个称为“Biological Condition Annotation Level BCAL(G)”的度量，其中 G 是一个差异表达基因的图。我们对基于 GO 的特征选择的主要结论如下：（1）当 BCAL(G)≥0.696 时，它会提高分类准确性；（2）当 BCAL(G)≤0.389 时，它会降低分类准确性；（3）当 0.389<BCAL(G)<0.696 且 δ<1 时，它会提供边际准确性提高；（4）当生物条件中的基因数量超过 50 个且 δ≥0.7 时，基于 GO 的特征选择的改进会减少；（5）当生物条件中的基因少于 10 个时，我们建议不要使用基于 GO 的特征选择。我们的结果是基于使用 RMA（稳健多阵列平均）预处理的数据集得出的，其中 δ 在 0.3 和 2.5 之间，训练样本大小在 20 到 200 之间，因此我们的结论仅限于这些规格。总的来说，这个模拟是创新的，解决了 SoFoCles 风格的特征选择何时应该用于分类而不是基于统计的排名度量的问题。

相似文献

A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification.利用基因本体论进行基因表达分类的特征选择方法分析的仿真。

J Biomed Inform. 2013 Dec;46(6):1044-59. doi: 10.1016/j.jbi.2013.07.008. Epub 2013 Jul 25.

SoFoCles: feature filtering for microarray classification based on gene ontology.SoFoCles：基于基因本体论的微阵列分类特征过滤。

J Biomed Inform. 2010 Feb;43(1):1-14. doi: 10.1016/j.jbi.2009.06.002. Epub 2009 Jul 1.

An efficient statistical feature selection approach for classification of gene expression data.一种用于基因表达数据分类的高效统计特征选择方法。

J Biomed Inform. 2011 Aug;44(4):529-35. doi: 10.1016/j.jbi.2011.01.001. Epub 2011 Jan 15.

Hybrid genetic algorithm-neural network: feature extraction for unpreprocessed microarray data.混合遗传算法-神经网络：未预处理微阵列数据的特征提取。

Artif Intell Med. 2011 Sep;53(1):47-56. doi: 10.1016/j.artmed.2011.06.008. Epub 2011 Jul 19.

Improved binary PSO for feature selection using gene expression data.使用基因表达数据的改进二进制粒子群优化算法进行特征选择

Comput Biol Chem. 2008 Feb;32(1):29-37. doi: 10.1016/j.compbiolchem.2007.09.005. Epub 2007 Sep 25.

A relation based measure of semantic similarity for Gene Ontology annotations.一种基于关系的基因本体注释语义相似度度量方法。

BMC Bioinformatics. 2008 Nov 4;9:468. doi: 10.1186/1471-2105-9-468.

Tumor classification ranking from microarray data.基于微阵列数据的肿瘤分类排名

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S21. doi: 10.1186/1471-2164-9-S2-S21.

A hybrid feature selection method for DNA microarray data.一种用于 DNA 微阵列数据的混合特征选择方法。

Comput Biol Med. 2011 Apr;41(4):228-37. doi: 10.1016/j.compbiomed.2011.02.004. Epub 2011 Mar 3.

Derivation of an artificial gene to improve classification accuracy upon gene selection.通过人工基因的推导来提高基因选择时的分类准确性。

Comput Biol Chem. 2012 Feb;36:1-12. doi: 10.1016/j.compbiolchem.2011.11.002. Epub 2011 Nov 28.

ADGO: analysis of differentially expressed gene sets using composite GO annotation.ADGO：使用复合基因本体注释分析差异表达基因集

Bioinformatics. 2006 Sep 15;22(18):2249-53. doi: 10.1093/bioinformatics/btl378. Epub 2006 Jul 12.

引用本文的文献

Graph-based semi-supervised learning with genomic data integration using condition-responsive genes applied to phenotype classification.基于图的半监督学习与条件响应基因的基因组数据集成在表型分类中的应用。

J Am Med Inform Assoc. 2018 Jan 1;25(1):99-108. doi: 10.1093/jamia/ocx032.

Confident gene activity prediction based on single histone modification H2BK5ac in human cell lines.基于人类细胞系中单一组蛋白修饰H2BK5ac的可靠基因活性预测

BMC Bioinformatics. 2017 Jan 25;18(1):67. doi: 10.1186/s12859-016-1418-6.

利用基因本体论进行基因表达分类的特征选择方法分析的仿真。

A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification.

机构信息

Dept. of Computer Science and Engineering, Oakland University, 2200 N Squirrel Rd, Rochester, MI 48309, United States.

出版信息

J Biomed Inform. 2013 Dec;46(6):1044-59. doi: 10.1016/j.jbi.2013.07.008. Epub 2013 Jul 25.

DOI:10.1016/j.jbi.2013.07.008

PMID:23892294

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

利用基因本体论进行基因表达分类的特征选择方法分析的仿真。

A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification.

机构信息

出版信息

相似文献

引用本文的文献

利用基因本体论进行基因表达分类的特征选择方法分析的仿真。

A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification.

机构信息

出版信息

相似文献

引用本文的文献