Dept. of Computer Science and Engineering, Oakland University, 2200 N Squirrel Rd, Rochester, MI 48309, United States.
J Biomed Inform. 2013 Dec;46(6):1044-59. doi: 10.1016/j.jbi.2013.07.008. Epub 2013 Jul 25.
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G) ≥ 0.696; (2) it decreases classification accuracy when BCAL(G) ≤ 0.389; (3) it provides marginal accuracy improvement when 0.389<BCAL(G)<0.696 and δ<1; (4) as the number of genes in a biological condition increases beyond 50 and δ ≥ 0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.
基因表达谱分类是一项重要的研究领域,有助于从传统医学向个性化医学的转变。基因表达数据分类面临的一个主要挑战是相对于大量基因而言,样本数量较少。为了解决这个问题,研究人员设计了各种特征选择算法来减少基因的数量。最近的研究一直在尝试使用基因本体论(GO)中基因之间的语义相似性作为一种改进特征选择的方法。虽然有一些研究讨论了如何使用 GO 进行特征选择,但没有模拟研究解决何时使用基于 GO 的特征选择。为了研究这个问题,我们开发了一种新的模拟方法,该方法生成二进制类数据集,其中两个类之间的差异表达基因在 GO 中具有某种潜在关系。这使我们能够研究各种因素的影响,例如 GO 中潜在基因的相对连通性、差异表达基因之间的分离程度(表示为δ)以及训练样本的数量。我们的模拟结果表明,生物条件下差异表达基因的 GO 连通性是决定基于 GO 的特征选择效果的主要因素。具体来说,随着差异表达基因连通性的增加,分类准确性的提高也会增加。为了量化这种连通性的概念,我们定义了一个称为“Biological Condition Annotation Level BCAL(G)”的度量,其中 G 是一个差异表达基因的图。我们对基于 GO 的特征选择的主要结论如下:(1)当 BCAL(G)≥0.696 时,它会提高分类准确性;(2)当 BCAL(G)≤0.389 时,它会降低分类准确性;(3)当 0.389<BCAL(G)<0.696 且 δ<1 时,它会提供边际准确性提高;(4)当生物条件中的基因数量超过 50 个且 δ≥0.7 时,基于 GO 的特征选择的改进会减少;(5)当生物条件中的基因少于 10 个时,我们建议不要使用基于 GO 的特征选择。我们的结果是基于使用 RMA(稳健多阵列平均)预处理的数据集得出的,其中 δ 在 0.3 和 2.5 之间,训练样本大小在 20 到 200 之间,因此我们的结论仅限于这些规格。总的来说,这个模拟是创新的,解决了 SoFoCles 风格的特征选择何时应该用于分类而不是基于统计的排名度量的问题。