Mallik Saurav, Bhadra Tapas, Maulik Ujjwal
IEEE Trans Nanobioscience. 2017 Jan;16(1):3-10. doi: 10.1109/TNB.2017.2650217. Epub 2017 Jan 9.
Epigenetic Biomarker discovery is an important task in bioinformatics. In this article, we develop a new framework of identifying statistically significant epigenetic biomarkers using maximal-relevance and minimal-redundancy criterion based feature (gene) selection for multi-omics dataset. Firstly, we determine the genes that have both expression as well as methylation values, and follow normal distribution. Similarly, we identify the genes which consist of both expression and methylation values, but do not follow normal distribution. For each case, we utilize a gene-selection method that provides maximal-relevant, but variable-weighted minimum-redundant genes as top ranked genes. For statistical validation, we apply t-test on both the expression and methylation data consisting of only the normally distributed top ranked genes to determine how many of them are both differentially expressed andmethylated. Similarly, we utilize Limma package for performing non-parametric Empirical Bayes test on both expression and methylation data comprising only the non-normally distributed top ranked genes to identify how many of them are both differentially expressed and methylated. We finally report the top-ranking significant gene-markerswith biological validation. Moreover, our framework improves positive predictive rate and reduces false positive rate in marker identification. In addition, we provide a comparative analysis of our gene-selection method as well as othermethods based on classificationperformances obtained using several well-known classifiers.
表观遗传生物标志物的发现是生物信息学中的一项重要任务。在本文中,我们开发了一种新的框架,用于使用基于最大相关性和最小冗余标准的特征(基因)选择方法,从多组学数据集中识别具有统计学意义的表观遗传生物标志物。首先,我们确定那些既有表达值又有甲基化值且服从正态分布的基因。同样,我们识别那些既有表达值又有甲基化值但不服从正态分布的基因。对于每种情况,我们使用一种基因选择方法,该方法提供最大相关但可变权重的最小冗余基因作为排名靠前的基因。为了进行统计验证,我们对仅由正态分布的排名靠前的基因组成的表达数据和甲基化数据应用t检验,以确定其中有多少基因同时存在差异表达和甲基化。同样,我们使用Limma软件包对仅由非正态分布的排名靠前的基因组成的表达数据和甲基化数据进行非参数经验贝叶斯检验,以识别其中有多少基因同时存在差异表达和甲基化。我们最终报告经过生物学验证的排名靠前的显著基因标志物。此外,我们的框架提高了阳性预测率,并降低了标志物识别中的假阳性率。此外,我们基于使用几种知名分类器获得的分类性能,对我们的基因选择方法以及其他方法进行了比较分析。