Department of Medical Informatics, China Medical University, Shenyang, Liaoning 110001, China.
BMC Bioinformatics. 2013 Jun 7;14:182. doi: 10.1186/1471-2105-14-182.
Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts).
SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings.
For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively.
基于图的概念在生物医学数据挖掘和知识发现任务中越来越多地被使用。在本文中,我们提出了一种团簇聚类方法,用于自动总结从 PubMed 引文中提取的语义谓词的图(标题和摘要)。
SemRep 用于从 PubMed 搜索返回的引文中提取语义谓词。通过基于节点度的中心度过滤,识别出具有高度连接参数的频繁出现的谓词的团簇。基于团簇之间共享的常见参数,使用层次聚类算法来识别摘要中的主题。对生成的摘要中的聚类的有效性进行了比较,以确定凝聚、分离和整体有效性的 Silhouette 生成基线。主题标签还与使用主要 MeSH 标题生成的参考标准进行了比较。
在测试数据集的 11 个主题中,系统摘要中的聚类的整体有效性比基线提高了 10%(43%比 33%)。与 MeSH 标题的参考标准相比,召回率、精度和 F 分数分别为 0.64、0.65 和 0.65。