Winters-Hilt Stephen, Merat Sam
Department of Computer Science, University of New Orleans, LA 70148, USA.
BMC Bioinformatics. 2007 Nov 1;8 Suppl 7(Suppl 7):S18. doi: 10.1186/1471-2105-8-S7-S18.
Support Vector Machines (SVMs) provide a powerful method for classification (supervised learning). Use of SVMs for clustering (unsupervised learning) is now being considered in a number of different ways.
An SVM-based clustering algorithm is introduced that clusters data with no a priori knowledge of input classes. The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labelled, this is repeated until an initial convergence occurs. Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed. The lowest confidence data (e.g., the worst of the mislabelled data) then has its' labels switched to the other class label. The SVM is then re-run on the data set (with partly re-labelled data) and is guaranteed to converge in this situation since it converged previously, and now it has fewer data points to carry with mislabelling penalties. This approach appears to limit exposure to the local minima traps that can occur with other approaches. Thus, the algorithm then improves on its weakly convergent result by SVM re-training after each re-labeling on the worst of the misclassified vectors - i.e., those feature vectors with confidence factor values beyond some threshold. The repetition of the above process improves the accuracy, here a measure of separability, until there are no misclassifications. Variations on this type of clustering approach are shown.
Non-parametric SVM-based clustering methods may allow for much improved performance over parametric approaches, particularly if they can be designed to inherit the strengths of their supervised SVM counterparts.
支持向量机(SVM)为分类(监督学习)提供了一种强大的方法。目前,人们正以多种不同方式考虑将支持向量机用于聚类(无监督学习)。
引入了一种基于支持向量机的聚类算法,该算法在对输入类别没有先验知识的情况下对数据进行聚类。该算法首先通过对一个数据集运行二元支持向量机分类器进行初始化,数据集中的每个向量都被随机标记,重复此操作直到出现初始收敛。一旦这个初始化步骤完成,就可以访问每个训练实例上用于分类的支持向量机置信度参数。然后,将置信度最低的数据(例如,标记错误最严重的数据)的标签切换为另一个类别标签。然后在数据集(带有部分重新标记的数据)上重新运行支持向量机,并且由于它之前已经收敛,并且现在需要承受错误标记惩罚的数据点更少,所以在这种情况下保证会收敛。这种方法似乎限制了陷入其他方法可能出现的局部极小值陷阱。因此,该算法在每次对分类错误最严重的向量(即置信因子值超过某个阈值的那些特征向量)重新标记后,通过支持向量机重新训练来改进其弱收敛结果。上述过程的重复提高了准确性,这里准确性是可分离性的一种度量,直到没有分类错误为止。展示了这种聚类方法的变体。
基于非参数支持向量机的聚类方法可能比参数方法具有显著更高的性能,特别是如果它们能够被设计成继承其监督式支持向量机对应方法的优势。