Department of Chemical and Biological Engineering, Iowa State University, Ames, IA 50011, USA.
BioData Min. 2010 Dec 17;3(1):11. doi: 10.1186/1756-0381-3-11.
Microarray data sets provide relative expression levels for thousands of genes for a small number, in comparison, of different experimental conditions called assays. Data mining techniques are used to extract specific information of genes as they relate to the assays. The multivariate statistical technique of principal component analysis (PCA) has proven useful in providing effective data mining methods. This article extends the PCA approach of Rollins et al. to the development of ranking genes of microarray data sets that express most differently between two biologically different grouping of assays. This method is evaluated on real and simulated data and compared to a current approach on the basis of false discovery rate (FDR) and statistical power (SP) which is the ability to correctly identify important genes.
This work developed and evaluated two new test statistics based on PCA and compared them to a popular method that is not PCA based. Both test statistics were found to be effective as evaluated in three case studies: (i) exposing E. coli cells to two different ethanol levels; (ii) application of myostatin to two groups of mice; and (iii) a simulated data study derived from the properties of (ii). The proposed method (PM) effectively identified critical genes in these studies based on comparison with the current method (CM). The simulation study supports higher identification accuracy for PM over CM for both proposed test statistics when the gene variance is constant and for one of the test statistics when the gene variance is non-constant.
PM compares quite favorably to CM in terms of lower FDR and much higher SP. Thus, PM can be quite effective in producing accurate signatures from large microarray data sets for differential expression between assays groups identified in a preliminary step of the PCA procedure and is, therefore, recommended for use in these applications.
微阵列数据集为少量的实验条件(称为检测)提供了数千个基因的相对表达水平。数据挖掘技术用于提取与检测相关的特定基因信息。主成分分析(PCA)的多元统计技术已被证明在提供有效的数据挖掘方法方面非常有用。本文扩展了 Rollins 等人的 PCA 方法,以开发在两个生物学差异较大的检测组之间表达差异最大的微阵列数据集基因排名的方法。该方法基于错误发现率(FDR)和统计功效(SP)在真实和模拟数据上进行了评估,FDR 是正确识别重要基因的能力,SP 是指正确识别重要基因的能力,然后与当前方法进行了比较。
本工作基于 PCA 开发和评估了两种新的检验统计量,并将其与一种非 PCA 基础的当前方法进行了比较。在三个案例研究中,这两种检验统计量都被证明是有效的:(i)将大肠杆菌细胞暴露于两种不同的乙醇水平;(ii)将肌肉生长抑制素应用于两组小鼠;(iii)从(ii)的性质衍生的模拟数据研究。与当前方法(CM)相比,所提出的方法(PM)有效地根据比较识别了这些研究中的关键基因。模拟研究支持当基因方差恒定时,PM 对 CM 具有更高的识别精度,当基因方差非恒定时,PM 对 CM 具有更高的识别精度。
PM 在较低的 FDR 和更高的 SP 方面与 CM 相比具有相当大的优势。因此,PM 可以在 PCA 过程的初步步骤中确定检测组之间的差异表达,并从大型微阵列数据集中生成准确的签名,因此推荐在这些应用中使用。