Berrar Daniel P, Downes C Stephen, Dubitzky Werner
School of Biomedical Sciences, University of Ulster at Coleraine, BT521SA, Northern Ireland.
Pac Symp Biocomput. 2003:5-16.
Gene expression profiling by microarray technology has been successfully applied to classification and diagnostic prediction of cancers. Various machine learning and data mining methods are currently used for classifying gene expression data. However, these methods have not been developed to address the specific requirements of gene microarray analysis. First, microarray data is characterized by a high-dimensional feature space often exceeding the sample space dimensionality by a factor of 100 or more. In addition, microarray data exhibit a high degree of noise. Most of the discussed methods do not adequately address the problem of dimensionality and noise. Furthermore, although machine learning and data mining methods are based on statistics, most such techniques do not address the biologist's requirement for sound mathematical confidence measures. Finally, most machine learning and data mining classification methods fail to incorporate misclassification costs, i.e. they are indifferent to the costs associated with false positive and false negative classifications. In this paper, we present a probabilistic neural network (PNN) model that addresses all these issues. The PNN model provides sound statistical confidences for its decisions, and it is able to model asymmetrical misclassification costs. Furthermore, we demonstrate the performance of the PNN for multiclass gene expression data sets. Here, we compare the performance of the PNN with two machine learning methods, a decision tree and a neural network. To assess and evaluate the performance of the classifiers, we use a lift-based scoring system that allows a fair comparison of different models. The PNN clearly outperformed the other models. The results demonstrate the successful application of the PNN model for multiclass cancer classification.
通过微阵列技术进行基因表达谱分析已成功应用于癌症的分类和诊断预测。目前,各种机器学习和数据挖掘方法被用于对基因表达数据进行分类。然而,这些方法尚未针对基因微阵列分析的特定要求进行开发。首先,微阵列数据的特征在于高维特征空间,其通常比样本空间维度大100倍或更多。此外,微阵列数据表现出高度的噪声。大多数讨论的方法没有充分解决维度和噪声问题。此外,尽管机器学习和数据挖掘方法基于统计学,但大多数此类技术并未满足生物学家对可靠数学置信度度量的要求。最后,大多数机器学习和数据挖掘分类方法未能纳入错误分类成本,即它们对与假阳性和假阴性分类相关的成本不敏感。在本文中,我们提出了一种概率神经网络(PNN)模型来解决所有这些问题。PNN模型为其决策提供了可靠的统计置信度,并且能够对不对称的错误分类成本进行建模。此外,我们展示了PNN在多类基因表达数据集上的性能。在这里,我们将PNN的性能与两种机器学习方法(决策树和神经网络)进行比较。为了评估和评价分类器的性能,我们使用基于提升的评分系统,该系统允许对不同模型进行公平比较。PNN明显优于其他模型。结果证明了PNN模型在多类癌症分类中的成功应用。