IEEE Trans Cybern. 2013 Aug;43(4):1265-76. doi: 10.1109/TSMCB.2012.2227998.
For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
对于生物医学文献的聚类,我们可以考虑三种不同类型的信息:来自文档的局部内容 (LC) 信息、来自整个 MEDLINE 集合的全局内容 (GC) 信息,以及医学主题词 (MeSH)-语义 (MS) 信息。以前用于聚类生物医学文档的方法不一定能有效地整合不同类型的信息,这些方法只使用了一种或两种类型的信息。最近,通过线性组合 LC 和 MS 信息,提高了 MEDLINE 文档聚类的性能。然而,由于组合不同类型信息(相似度)的表示空间有限且可靠性不同,简单的线性组合可能效果不佳。为了克服这一限制,我们提出了一种新的半监督谱聚类方法,即 SSNCut,用于对 LC 相似度进行聚类,并采用了两种约束:具有高 MS(或 GC)相似度的文档对的必须链接 (ML) 约束,以及具有低相似度的文档对的不可链接 (CL) 约束。我们通过使用 100 个 MEDLINE 记录数据集,对 SSNCut 在 MEDLINE 文档聚类中的性能进行了实证研究。实验结果表明,SSNCut 优于线性组合方法和几种著名的半监督聚类方法,且具有统计学意义。此外,同时使用 MS 和 GC 相似度约束的 SSNCut 性能优于仅使用一种相似度约束的 SSNCut。另一个有趣的发现是,ML 约束比 CL 约束更有效,因为 CL 约束中包含约 10%的错误约束,而 ML 约束中只有 1%。