Lu Yan, Liu Peng-Yuan, Xiao Peng, Deng Hong-Wen
Osteoporosis Research Center, Creighton University 601 N. 30th Street, Suite 6787, Omaha, NE 68131, USA.
Bioinformatics. 2005 Jul 15;21(14):3105-13. doi: 10.1093/bioinformatics/bti496. Epub 2005 May 19.
The most widely used statistical methods for finding differentially expressed genes (DEGs) are essentially univariate. In this study, we present a new T(2) statistic for analyzing microarray data. We implemented our method using a multiple forward search (MFS) algorithm that is designed for selecting a subset of feature vectors in high-dimensional microarray datasets. The proposed T2 statistic is a corollary to that originally developed for multivariate analyses and possesses two prominent statistical properties. First, our method takes into account multidimensional structure of microarray data. The utilization of the information hidden in gene interactions allows for finding genes whose differential expressions are not marginally detectable in univariate testing methods. Second, the statistic has a close relationship to discriminant analyses for classification of gene expression patterns. Our search algorithm sequentially maximizes gene expression difference/distance between two groups of genes. Including such a set of DEGs into initial feature variables may increase the power of classification rules. We validated our method by using a spike-in HGU95 dataset from Affymetrix. The utility of the new method was demonstrated by application to the analyses of gene expression patterns in human liver cancers and breast cancers. Extensive bioinformatics analyses and cross-validation of DEGs identified in the application datasets showed the significant advantages of our new algorithm.
用于寻找差异表达基因(DEG)的最广泛使用的统计方法本质上是单变量的。在本研究中,我们提出了一种用于分析微阵列数据的新T(2)统计量。我们使用一种多重前向搜索(MFS)算法来实现我们的方法,该算法旨在在高维微阵列数据集中选择特征向量的一个子集。所提出的T2统计量是最初为多变量分析开发的统计量的一个推论,并且具有两个突出的统计特性。首先,我们的方法考虑了微阵列数据的多维结构。利用隐藏在基因相互作用中的信息能够找到在单变量测试方法中无法从边缘检测到其差异表达的基因。其次,该统计量与用于基因表达模式分类的判别分析密切相关。我们的搜索算法依次最大化两组基因之间的基因表达差异/距离。将这样一组差异表达基因纳入初始特征变量可能会提高分类规则的功效。我们通过使用来自Affymetrix的一个掺入式HGU95数据集验证了我们的方法。通过将其应用于人类肝癌和乳腺癌的基因表达模式分析,证明了该新方法的实用性。对应用数据集中鉴定出的差异表达基因进行的广泛生物信息学分析和交叉验证显示了我们新算法的显著优势。