Das Samarendra, Rai Anil, Mishra D C, Rai Shesh N
Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India; Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
Gene. 2018 May 20;655:71-83. doi: 10.1016/j.gene.2018.02.044. Epub 2018 Feb 16.
Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies.
从高维基因表达数据中选择信息基因已成为基因组学中的一个重要研究领域。迄今为止提出的许多基因选择技术要么基于相关性度量,要么基于冗余性度量。此外,这些技术的性能是通过使用所选基因的分类器计算的选择后分类准确率来判定的。这种性能指标在统计学上可能是合理的,但可能与生物学无关。基于最大相关性和最小冗余性的综合度量提出了一种统计方法,即Boot-MRMR,它对于信息基因选择在统计学上是合理的且与生物学相关。为了对所提出的方法进行比较评估,我们制定了两个生物学充分标准,即基于QTL的基因集富集(GSEQ)和基于基因本体(GO)的生物学相似性得分。此外,使用五个基因表达数据集对所提出的技术与12种现有的基因选择技术进行了系统而严格的评估。该评估基于多标准决策框架下广泛的统计学合理(例如主题分类)和生物学相关(基于QTL和GO)标准。性能分析表明,所提出的技术选择的信息基因与生物学的相关性更强。在所提出的技术在主题分类和计算时间方面也被发现与现有技术相当具有竞争力。我们的结果还表明,在多标准决策设置下,所提出的技术在信息基因选择方面优于现有替代方法。基于所提出的方法,已经开发了一个R包,即BootMRMR,可在https://cran.r-project.org/web/packages/BootMRMR上获取。本研究将为从高维表达数据中选择信息基因用于育种和系统生物学研究提供选择统计技术的实用指南。