Braga-Neto Ulisses, Hashimoto Ronaldo, Dougherty Edward R, Nguyen Danh V, Carroll Raymond J
Section of Clinical Cancer Genetics, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA.
Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.
Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance.
A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.
对基因特征集进行排序是表型分类(例如DNA微阵列实验中的肿瘤分类)以及基因调控网络背景下预测的关键问题。有两种广泛使用的方法来估计分类器的误差(误分类率)。再代入法将单个分类器拟合到数据上,然后依次将该分类器应用于每个数据观测值。交叉验证(留一法形式)依次移除每个观测值,构建分类器,然后计算这个留一法分类器是否正确分类被删除的观测值。再代入法通常会低估分类器误差,在许多情况下严重低估。交叉验证的优点是能产生有效无偏的误差估计,但该估计具有高度变异性。在许多应用中,人们感兴趣的并非误分类率本身,而是具有分类或预测潜力的基因集的构建。因此,需要根据特征集的性能对其进行排序。
采用基于模型的方法,比较再代入法和交叉验证法在基于实值特征集的分类以及概率布尔网络(PBN)背景下预测时的排序性能。对于分类,考虑高斯模型,以及通过线性判别分析和3近邻分类规则进行分类。在PBN的稳态分布中检验预测。提出了三个指标,用于比较基于误差估计的特征集排序与基于真实误差(由于基于模型的方法而可知)的排序。在所有情况下,相对于排序准确性,再代入法与交叉验证法具有竞争力。此外,再代入法在计算时间上有巨大节省。