Jiang Peng, Missoum Samy, Chen Zhao
Aerospace and Mechanical Engineering Department, University of Arizona, Tucson, Arizona.
Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, Arizona.
Struct Multidiscipl Optim. 2014 Oct 1;50(4):523-535. doi: 10.1007/s00158-014-1105-z.
This article presents a study of three validation metrics used for the selection of optimal parameters of a support vector machine (SVM) classifier in the case of non-separable and unbalanced datasets. This situation is often encountered when the data is obtained experimentally or clinically. The three metrics selected in this work are the area under the ROC curve (AUC), accuracy, and balanced accuracy. These validation metrics are tested using computational data only, which enables the creation of fully separable sets of data. This way, non-separable datasets, representative of a real-world problem, can be created by projection onto a lower dimensional sub-space. The knowledge of the separable dataset, unknown in real-world problems, provides a reference to compare the three validation metrics using a quantity referred to as the "weighted likelihood". As an application example, the study investigates a classification model for hip fracture prediction. The data is obtained from a parameterized finite element model of a femur. The performance of the various validation metrics is studied for several levels of separability, ratios of unbalance, and training set sizes.
本文介绍了一项关于三种验证指标的研究,这些指标用于在非可分和不平衡数据集的情况下选择支持向量机(SVM)分类器的最优参数。当通过实验或临床获得数据时,经常会遇到这种情况。本研究选择的三个指标是ROC曲线下面积(AUC)、准确率和平衡准确率。这些验证指标仅使用计算数据进行测试,这使得能够创建完全可分的数据集。通过这种方式,可以通过投影到低维子空间来创建代表现实世界问题的非可分数据集。在现实世界问题中未知的可分数据集的知识,提供了一个参考,用于使用称为“加权似然”的量来比较这三种验证指标。作为一个应用示例,该研究调查了一个髋部骨折预测的分类模型。数据来自股骨的参数化有限元模型。针对几种可分性水平、不平衡比率和训练集大小,研究了各种验证指标的性能。