基于 MeSH 语义和全局内容约束的高效半监督 MEDLINE 文档聚类。

Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints.

出版信息

IEEE Trans Cybern. 2013 Aug;43(4):1265-76. doi: 10.1109/TSMCB.2012.2227998.

DOI:10.1109/TSMCB.2012.2227998

Abstract

For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.

摘要

对于生物医学文献的聚类，我们可以考虑三种不同类型的信息：来自文档的局部内容 (LC) 信息、来自整个 MEDLINE 集合的全局内容 (GC) 信息，以及医学主题词 (MeSH)-语义 (MS) 信息。以前用于聚类生物医学文档的方法不一定能有效地整合不同类型的信息，这些方法只使用了一种或两种类型的信息。最近，通过线性组合 LC 和 MS 信息，提高了 MEDLINE 文档聚类的性能。然而，由于组合不同类型信息（相似度）的表示空间有限且可靠性不同，简单的线性组合可能效果不佳。为了克服这一限制，我们提出了一种新的半监督谱聚类方法，即 SSNCut，用于对 LC 相似度进行聚类，并采用了两种约束：具有高 MS（或 GC）相似度的文档对的必须链接 (ML) 约束，以及具有低相似度的文档对的不可链接 (CL) 约束。我们通过使用 100 个 MEDLINE 记录数据集，对 SSNCut 在 MEDLINE 文档聚类中的性能进行了实证研究。实验结果表明，SSNCut 优于线性组合方法和几种著名的半监督聚类方法，且具有统计学意义。此外，同时使用 MS 和 GC 相似度约束的 SSNCut 性能优于仅使用一种相似度约束的 SSNCut。另一个有趣的发现是，ML 约束比 CL 约束更有效，因为 CL 约束中包含约 10%的错误约束，而 ML 约束中只有 1%。

相似文献

Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints.

IEEE Trans Cybern. 2013 Aug;43(4):1265-76. doi: 10.1109/TSMCB.2012.2227998.

Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity.

Bioinformatics. 2009 Aug 1;25(15):1944-51. doi: 10.1093/bioinformatics/btp338. Epub 2009 Jun 3.

MeSHSim: An R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents.

J Bioinform Comput Biol. 2015 Dec;13(6):1542002. doi: 10.1142/S0219720015420020. Epub 2015 Sep 9.

A knowledge-driven approach to biomedical document conceptualization.

Artif Intell Med. 2010 Jun;49(2):67-78. doi: 10.1016/j.artmed.2010.02.005. Epub 2010 Apr 3.

Context-driven automatic subgraph creation for literature-based discovery.

J Biomed Inform. 2015 Apr;54:141-57. doi: 10.1016/j.jbi.2015.01.014. Epub 2015 Feb 7.

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering.

Comput Math Methods Med. 2021 Nov 9;2021:7937573. doi: 10.1155/2021/7937573. eCollection 2021.

Exploring supervised and unsupervised methods to detect topics in biomedical text.

BMC Bioinformatics. 2006 Mar 16;7:140. doi: 10.1186/1471-2105-7-140.

Acquiring Plausible Predications from MEDLINE by Clustering MeSH Annotations.

Stud Health Technol Inform. 2015;216:716-20.

Knowledge Extraction from MEDLINE by Combining Clustering with Natural Language Processing.

AMIA Annu Symp Proc. 2015 Nov 5;2015:915-24. eCollection 2015.

DeepMeSH: deep semantic representation for improving large-scale MeSH indexing.

Bioinformatics. 2016 Jun 15;32(12):i70-i79. doi: 10.1093/bioinformatics/btw294.

引用本文的文献

A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments.

Nat Commun. 2022 Apr 28;13(1):2326. doi: 10.1038/s41467-022-29843-y.

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering.

Comput Math Methods Med. 2021 Nov 9;2021:7937573. doi: 10.1155/2021/7937573. eCollection 2021.

FullMeSH: improving large-scale MeSH indexing with full text.

Bioinformatics. 2020 Mar 1;36(5):1533-1541. doi: 10.1093/bioinformatics/btz756.

SolidBin: improving metagenome binning with semi-supervised normalized cut.

Bioinformatics. 2019 Nov 1;35(21):4229-4238. doi: 10.1093/bioinformatics/btz253.

Biomedical semantic indexing by deep neural network with multi-task learning.

BMC Bioinformatics. 2018 Dec 21;19(Suppl 20):502. doi: 10.1186/s12859-018-2534-2.

DeepMeSH: deep semantic representation for improving large-scale MeSH indexing.

Bioinformatics. 2016 Jun 15;32(12):i70-i79. doi: 10.1093/bioinformatics/btw294.

MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence.

Bioinformatics. 2015 Jun 15;31(12):i339-47. doi: 10.1093/bioinformatics/btv237.

Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial.

Methods. 2015 Mar;74:65-70. doi: 10.1016/j.ymeth.2014.11.006. Epub 2014 Nov 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于 MeSH 语义和全局内容约束的高效半监督 MEDLINE 文档聚类。

Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints.

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献