Department of Computing Science, Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, Singapore 138632, Singapore; School of Electrical and Electronics Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore.
The Institute for Medicine and Public Health, Vanderbilt University, 2525 West End Avenue, Suite 600, Nashville, TN 37203-1738, USA.
Artif Intell Med. 2014 Mar;60(3):189-96. doi: 10.1016/j.artmed.2014.01.003. Epub 2014 Feb 7.
Support vector machines (SVMs) have drawn considerable attention due to their high generalisation ability and superior classification performance compared to other pattern recognition algorithms. However, the assumption that the learning data is identically generated from unknown probability distributions may limit the application of SVMs for real problems. In this paper, we propose a vicinal support vector classifier (VSVC) which is shown to be able to effectively handle practical applications where the learning data may originate from different probability distributions.
The proposed VSVC method utilises a set of new vicinal kernel functions which are constructed based on supervised clustering in the kernel-induced feature space. Our proposed approach comprises two steps. In the clustering step, a supervised kernel-based deterministic annealing (SKDA) clustering algorithm is employed to partition the training data into different soft vicinal areas of the feature space in order to construct the vicinal kernel functions. In the training step, the SVM technique is used to minimise the vicinal risk function under the constraints of the vicinal areas defined in the SKDA clustering step.
Experimental results on both artificial and real medical datasets show our proposed VSVC achieves better classification accuracy and lower computational time compared to a standard SVM. For an artificial dataset constructed from non-separated data, the classification accuracy of VSVC is between 95.5% and 96.25% (using different cluster numbers) which compares favourably to the 94.5% achieved by SVM. The VSVC training time is between 8.75s and 17.83s (for 2-8 clusters), considerable less than the 65.0s required by SVM. On a real mammography dataset, the best classification accuracy of VSVC is 85.7% and thus clearly outperforms a standard SVM which obtains an accuracy of only 82.1%. A similar performance improvement is confirmed on two further real datasets, a breast cancer dataset (74.01% vs. 72.52%) and a heart dataset (84.77% vs. 83.81%), coupled with a reduction in terms of learning time (32.07s vs. 92.08s and 25.00s vs. 53.31s, respectively). Furthermore, the VSVC results in the number of support vectors being equal to the specified cluster number, and hence in a much sparser solution compared to a standard SVM.
Incorporating a supervised clustering algorithm into the SVM technique leads to a sparse but effective solution, while making the proposed VSVC adaptive to different probability distributions of the training data.
与其他模式识别算法相比,支持向量机(Support Vector Machine,SVM)具有较高的泛化能力和出色的分类性能,因此受到了广泛关注。然而,SVM 假设学习数据是从未知概率分布中完全生成的,这可能会限制 SVM 在实际问题中的应用。在本文中,我们提出了一种邻接支持向量分类器(vicinal support vector classifier,VSVC),该分类器能够有效地处理学习数据可能来自不同概率分布的实际应用。
所提出的 VSVC 方法利用了一组新的邻接核函数,这些核函数是基于核诱导特征空间中的监督聚类构建的。我们的方法包括两个步骤。在聚类步骤中,采用基于监督的核确定性退火(supervised kernel-based deterministic annealing,SKDA)聚类算法将训练数据划分为特征空间中的不同软邻域,以构建邻接核函数。在训练步骤中,使用 SVM 技术在 SKDA 聚类步骤中定义的邻域约束下最小化邻接风险函数。
在人工和真实医疗数据集上的实验结果表明,与标准 SVM 相比,我们提出的 VSVC 实现了更好的分类准确性和更低的计算时间。对于由非分离数据构建的人工数据集,VSVC 的分类准确性在 95.5%到 96.25%之间(使用不同的聚类数),明显优于 SVM 获得的 94.5%。VSVC 的训练时间在 8.75s 到 17.83s 之间(对于 2-8 个聚类),明显少于 SVM 需要的 65.0s。在真实的乳房 X 线照片数据集上,VSVC 的最佳分类准确性为 85.7%,明显优于仅获得 82.1%准确性的标准 SVM。在另外两个真实数据集(乳腺癌数据集和心脏数据集)上也证实了类似的性能提升,同时学习时间也有所减少(分别为 32.07s 对 92.08s 和 25.00s 对 53.31s)。此外,VSVC 的支持向量数量等于指定的聚类数量,因此与标准 SVM 相比,解决方案更加稀疏。
将监督聚类算法纳入 SVM 技术可得到稀疏但有效的解决方案,同时使所提出的 VSVC 适应训练数据的不同概率分布。