通过群组标记指数从基因表达谱中鉴定单类和多类特异性特征基因。

Identification of single- and multiple-class specific signature genes from gene expression profiles by group marker index.

机构信息

Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.

出版信息

PLoS One. 2011;6(9):e24259. doi: 10.1371/journal.pone.0024259. Epub 2011 Sep 1.

DOI:10.1371/journal.pone.0024259

PMID:21909426

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3164723/

Abstract

Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases.

摘要

从微阵列数据中提取的信息基因可用于构建预测模型和研究生物机制。差异表达基因是大多数基因选择方法的主要目标，可以分为单类和多类特异性特征基因。在这里，我们提出了一种新的基于组标记指数（GMI）的基因选择算法，该算法直观、计算复杂度低、能够有效地识别两种类型的基因。大多数基因选择方法仅识别单类特异性特征基因，并且不容易识别多类特异性特征基因。我们的算法可以检测到基因的新的多类特异性，并利用一种新的非参数指标来评估类间的区分能力。即使在样本量小且类大小差异显著的情况下，我们的方法也很有效。为了比较有效性和鲁棒性，我们制定了一个直观的基于模板的方法，并使用了四个著名的数据集。我们证明，在不平衡分布的困难情况下，我们的算法优于基于模板的方法。此外，多类特异性基因是良好的生物标志物，在生物途径中发挥着重要作用。我们的文献调查支持所提出的方法在中枢神经系统数据中识别独特的多类特异性标记基因（之前未报道与癌症有关）。它还发现了独特的生物标志物，表明肺癌亚型之间存在内在差异。我们还将途径信息与多类特异性特征基因相关联，并与已发表的研究进行交叉引用。我们发现，鉴定的基因参与了白血病数据中癌症发展的直接相关途径。我们的方法为寻找可能涉及多种疾病途径的基因提供了一种有前途的方法，从而为在其他疾病中使用现有药物以及为多种疾病设计单一药物开辟了可能性。