Shaik Jahangheer S, Yeasin Mohammed
Department of Electrical and Computer Engineering, CVPIA Lab, University of Memphis, Memphis, TN-38152, USA.
BMC Bioinformatics. 2007 Sep 18;8:347. doi: 10.1186/1471-2105-8-347.
This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework.
The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets.
This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.
本文提出了一个从微阵列数据中寻找差异表达基因(DEG)的统一框架。所提出的框架有三个相互关联的模块:(i)基因排序,(ii)基因的显著性分析,以及(iii)验证。第一个模块使用两种基因选择算法,即a)双向聚类和b)组合自适应排序来对基因进行排序。第二个模块使用R检验将基因排名转换为p值,并使用Fisher综合准则融合两组p值。通过FDR分析选择DEG。第三个模块对获得的DEG进行三重验证。首先使用错误发现率分析来说明所提出的统一框架在基因选择中的稳健性。此外,通过在训练和测试数据集上采用基于自适应子空间的聚类算法对DEG进行基于聚类的验证。最后,进行基于投影的可视化以验证使用统一框架获得的DEG。
将统一框架的性能与t统计量、微阵列显著性分析(SAM)、自适应排序、组合自适应排序和双向聚类等著名排序算法进行了比较。使用50个分别遵循两种不同分布的模拟微阵列数据集获得的性能曲线表明,统一框架优于其他已报道的算法。对3个真实癌症数据集和3个帕金森病数据集的进一步分析显示了类似的性能提升。首先,为双样本癌症数据集提供了一个三重验证过程。此外,对3组帕金森病数据进行了分析,以证明所提出方法对多样本微阵列数据集的可扩展性。
本文提出了一个用于从双样本以及多样本微阵列实验中稳健选择基因的统一框架。模块1中使用的两种不同排序方法在基因选择上带来了多样性。将排名转换为p值、p值融合和FDR分析有助于识别仅基于基因排名无法判断的显著基因。三重验证,即使用FDR分析在基因选择中的稳健性、聚类和可视化证明了DEG的相关性。对50个人工数据集和6个真实微阵列数据集的实证分析说明了所提出方法的有效性。对3个癌症数据集的分析证明了所提出方法在具有两类样本的微阵列数据集上的实用性。使用三组帕金森病数据解决了所提出的统一方法对多样本(超过两个样本类)微阵列数据集的可扩展性。实证分析表明,在从微阵列数据中选择差异表达基因方面,统一框架优于其他基因选择方法。