Liu Zhenqiu, Tan Ming
Division of Biostatistics, University of Maryland Greenebaum Cancer Center, Baltimore, Maryland 21201, USA.
Biometrics. 2008 Dec;64(4):1155-61. doi: 10.1111/j.1541-0420.2008.01015.x. Epub 2008 Mar 24.
In medical diagnosis, the diseased and nondiseased classes are usually unbalanced and one class may be more important than the other depending on the diagnosis purpose. Most standard classification methods, however, are designed to maximize the overall accuracy and cannot incorporate different costs to different classes explicitly. In this article, we propose a novel nonparametric method to directly maximize the weighted specificity and sensitivity of the receiver operating characteristic curve. Combining advances in machine learning, optimization theory, and statistics, the proposed method has excellent generalization property and assigns different error costs to different classes explicitly. We present experiments that compare the proposed algorithms with support vector machines and regularized logistic regression using data from a study on HIV-1 protease as well as six public available datasets. Our main conclusion is that the performance of proposed algorithm is significantly better in most cases than the other classifiers tested. Software package in MATLAB is available upon request.
在医学诊断中,患病和未患病类别通常是不平衡的,并且根据诊断目的,其中一类可能比另一类更重要。然而,大多数标准分类方法旨在最大化总体准确率,并且不能明确地对不同类别纳入不同成本。在本文中,我们提出了一种新颖的非参数方法,以直接最大化接收器操作特性曲线的加权特异性和敏感性。结合机器学习、优化理论和统计学方面的进展,所提出的方法具有出色的泛化性能,并明确地为不同类别分配不同的错误成本。我们展示了一些实验,这些实验使用来自一项关于HIV-1蛋白酶的研究以及六个公开可用数据集的数据,将所提出的算法与支持向量机和正则化逻辑回归进行比较。我们的主要结论是,在所测试的大多数情况下,所提出算法的性能明显优于其他分类器。如有需要,可提供MATLAB软件包。