Department of Analytical Chemistry and Organic Chemistry, Rovira i Virgili University, 43007 Tarragona, Spain.
Talanta. 2009 Nov 15;80(1):321-8. doi: 10.1016/j.talanta.2009.06.072. Epub 2009 Jul 7.
Microarrays are used to simultaneously determine the expressions of thousands of genes. An important application of microarrays is in the classification of samples into classes of interest (e.g. either healthy cells or tumour cells). Discriminant partial least squares (DPLS) has often been used for this purpose. In this paper, we describe an improvement to DPLS that uses kernel-based probability density functions and the Bayes rule to classify samples whilst keeping the option of not classifying the sample if this cannot be done with sufficient confidence. With this approach, those samples outside the boundaries of the known classes or from the ambiguity region between classes are rejected and only samples with a high probability of being correctly classified are indeed classified. The optimal model is found by simultaneously minimizing the misclassification and rejection costs. The method (p-DPLS with reject option) was tested with two datasets. For the human cancers dataset the accuracy (obtained by leave-one-out cross-validation) was improved from 97% to 99% when compared to p-DPLS without reject option. For the breast cancer dataset, p-DPLS with reject option was able to reject 100% of the test samples that did not belong to any of the modelled classes. These samples would have been misclassified if the reject option had not been considered.
微阵列被用于同时测定数千个基因的表达。微阵列的一个重要应用是将样本分类到感兴趣的类别中(例如,健康细胞或肿瘤细胞)。判别偏最小二乘法(DPLS)经常被用于此目的。在本文中,我们描述了一种对 DPLS 的改进,该改进使用基于核的概率密度函数和贝叶斯规则对样本进行分类,同时保留了如果没有足够的置信度就不进行分类的选项。通过这种方法,那些位于已知类别边界之外或类别之间的模糊区域之外的样本将被拒绝,只有那些具有高概率被正确分类的样本才会被真正分类。通过同时最小化误分类和拒绝成本来找到最优模型。该方法(带拒绝选项的 p-DPLS)使用两个数据集进行了测试。对于人类癌症数据集,与不带拒绝选项的 p-DPLS 相比,通过留一法交叉验证获得的准确性从 97%提高到了 99%。对于乳腺癌数据集,带拒绝选项的 p-DPLS 能够拒绝属于任何已建模类别之外的测试样本的 100%。如果不考虑拒绝选项,这些样本将被错误分类。