Tsai Chen-An, Lee Te-Chang, Ho I-Ching, Yang Ueng-Cheng, Chen Chun-Houh, Chen James J
Division of Biometry and Risk Assessment, National Center for Toxicological Research, Food and Drug Administration NCTR/FDA/HFT-20 Jefferson, AR 72079, USA.
Math Biosci. 2005 Jan;193(1):79-100. doi: 10.1016/j.mbs.2004.07.002. Epub 2004 Dec 28.
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%.
DNA微阵列技术提供了可同时研究大量不同基因表达谱的工具。该技术已应用于样本聚类和样本预测。由于测量的基因数量众多,原始数据集中的许多基因与分析无关。选择具有区分性的基因对于聚类和预测的准确性至关重要。本文考虑采用统计显著性检验方法来选择用于多类聚类和实验样本预测的具有区分性的基因集。使用一个包含九种处理(一种对照和八种金属,即砷、镉、镍、铬、锑、铅、铜和砷酸五价物,共55个样本)的毒理基因组数据集来说明该方法的一般框架。在四个选定的基因集中,由F检验与一对一t检验的并集的交集形成的基因集ω(I)在聚类和预测方面表现最佳。层次聚类法和两种改进的划分(k均值)方法均表明,基因集ω(I)能够将55个样本合理地分为七个簇,其中砷和砷酸五价物样本被视为一个簇(同一组),镉和铜样本也是如此。在预测方面,使用最近邻算法将55个样本预测为九种处理之一时,基因集ω(I)的总体准确率为85%。