Kamath Vidya, Yeatman Timothy J, Eschrich Steven A
Biomedical Engineering program at the University of South Florida, Tampa, Florida, USA.
Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:5704-7. doi: 10.1109/IEMBS.2008.4650509.
Gene expression signatures identify important genes that predict a specified outcome. In several notable diseases such as leukemia and breast cancer, the results have been encouraging. In these datasets, many techniques work well when discriminating particular outcomes. However, these same methods, applied to other datasets, are unable to achieve similar levels of success. Given the small sample sizes common to these studies and the large dimensionality of the data, several key issues exist when attempting to construct reliable, reproducible gene signatures. The classifiers may not be sufficient to discriminate classes, or the data itself may not be sufficient to produce effective separation. In this paper, three simple measures of classification complexity are considered to explore a limit to the predictive accuracy that can be achieved in a dataset. Two independent gene expression datasets (lung and colorectal cancer) are considered, using three different outcomes on each dataset. Four different classifiers, using the t-test for feature selection, were tested on these datasets as a representative panel of classifiers. Our results indicate that Fisher's discriminant ratio provides a good measure of the complexity of the classification problem, with a high correlation between complexity and best classification accuracy across all problems (R(2)=0.78). Specifically, predicting gender is a low complexity problem as indicated both by the complexity measure and the classification results. More clinically-oriented endpoints are more complex and have lower classification accuracies. Therefore, classification complexity can be used to estimate maximum attainable accuracy for a problem reducing the need to evaluate many different classifiers.
基因表达特征可识别出能预测特定结果的重要基因。在白血病和乳腺癌等几种著名疾病中,结果令人鼓舞。在这些数据集中,许多技术在区分特定结果时效果良好。然而,将这些相同的方法应用于其他数据集时,却无法取得类似的成功水平。鉴于这些研究中常见的样本量较小以及数据的高维度,在尝试构建可靠、可重复的基因特征时存在几个关键问题。分类器可能不足以区分类别,或者数据本身可能不足以实现有效的分离。在本文中,考虑了三种简单的分类复杂度度量方法,以探索一个数据集中可实现的预测准确性的极限。我们考虑了两个独立的基因表达数据集(肺癌和结直肠癌),每个数据集使用三种不同的结果。使用t检验进行特征选择的四种不同分类器,作为一组具有代表性的分类器在这些数据集上进行了测试。我们的结果表明,费希尔判别比能很好地衡量分类问题的复杂度,在所有问题中,复杂度与最佳分类准确性之间具有高度相关性(R(2)=0.78)。具体而言,正如复杂度度量和分类结果所表明的,预测性别是一个低复杂度问题。更多以临床为导向的终点更为复杂,分类准确性较低。因此,分类复杂度可用于估计一个问题的最大可达到的准确性,从而减少评估许多不同分类器的必要性。