在为美国国立医学图书馆医学主题词表（UMLS）注释的基于PubMed Central开放获取文章的语义相似性度量标准的研究中。

Garcia Castro Leyla Jael, Berlanga Rafael, Garcia Alexander

Temporal Knowledge Bases Group, Department of Computer Languages and Systems, Universitat Jaume I, 12071 Castelló de la Plana, Spain.

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

MOTIVATION

Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript.

METHODS

We have benchmarked three similarity metrics - BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles.

RESULTS

We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles.

AVAILABILITY

Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323.

动机

尽管出版商以电子格式提供全文文章，但在标题和摘要上下文之外查找相关工作仍然是一项挑战。基于摘要识别相关文章确实是一个很好的起点；这个过程很简单，并且不会像基于全文的相似度那样消耗大量资源。然而，进一步的分析可能需要深入理解全文内容。两篇摘要高度相关的文章在全文内容上可能有很大差异。考虑标题和摘要与全文时相似度如何不同，以及在处理全文文章时哪种语义相似度度量能提供更好的结果，是本手稿要解决的主要问题。

方法

我们对三种相似度度量——BM25、PMRA和余弦相似度进行了基准测试，以确定在对全文文档使用基于概念的注释时哪种度量表现最佳。我们还评估了基于标题和摘要的相似度值与基于全文的相似度值之间的差异。我们的测试数据集包括2005年文本检索会议的基因组学赛道文章集。最初，我们使用实体识别软件用统一医学语言系统（UMLS®）中定义的概念对标题、摘要以及全文进行语义注释。对于每篇文章，我们创建了一个文档概要，即一组识别出的概念、词频和逆文档频率；然后我们将各种相似度度量应用于这些文档概要。我们考虑了相关性、精确率、召回率和F1值，以确定哪种相似度度量在基于概念的注释中表现最佳。对于那些可在PubMed Central开放获取（PMC - OA）中的全文文章，我们还进行了离散度分析，以了解在考虑全文文章时相似度是如何变化的。

结果

我们发现PubMed相关文章相似度度量最适合用UMLS概念注释的全文文章。对于相似度值高于0.8的情况，所有度量的F1值约为0.2，召回率约为0.1；BM25的精确率最高，接近1；在所有情况下，基于概念的度量比基于词干的度量表现更好。我们的实验表明，仅考虑标题和摘要与全文相似度时相似度值会有所不同。因此，当特定研究需要超越标题和摘要，特别是在文章之间的关联性方面时，基于全文的分析就变得很有用。

可用性

可视化可在ljgarcia.github.io/semsim.benchmark/获取，数据可在http://dx.doi.org/10.5281/zenodo.13323获取。

相似文献

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

Corpus domain effects on distributional semantic modeling of medical terms.

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization, and comparison of MeSH terms.

BMC Bioinformatics. 2020 Jun 18;21(1):252. doi: 10.1186/s12859-020-03583-6.

Textpresso: an ontology-based information retrieval and extraction system for biological literature.

PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions.

Artif Intell Med. 2020 Jan;102:101767. doi: 10.1016/j.artmed.2019.101767. Epub 2019 Nov 28.

UMLS-Interface and UMLS-Similarity : open source software for measuring paths and semantic similarity.

AMIA Annu Symp Proc. 2009 Nov 14;2009:431-5.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Towards a unified search: Improving PubMed retrieval with full text.

J Biomed Inform. 2022 Oct;134:104211. doi: 10.1016/j.jbi.2022.104211. Epub 2022 Sep 21.

引用本文的文献

MCRWR: a new method to measure the similarity of documents based on semantic network.

BMC Bioinformatics. 2022 Feb 1;23(1):56. doi: 10.1186/s12859-022-04578-1.

Large expert-curated database for benchmarking document similarity detection in biomedical literature search.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz085.

Evaluation of standard and semantically-augmented distance metrics for neurology patients.

BMC Med Inform Decis Mak. 2020 Aug 26;20(1):203. doi: 10.1186/s12911-020-01217-8.

Integrating unified medical language system and association mining techniques into relevance feedback for biomedical literature search.

BMC Bioinformatics. 2016 Jul 19;17 Suppl 9(Suppl 9):264. doi: 10.1186/s12859-016-1129-z.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

Corpus domain effects on distributional semantic modeling of medical terms.

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization, and comparison of MeSH terms.

BMC Bioinformatics. 2020 Jun 18;21(1):252. doi: 10.1186/s12859-020-03583-6.

Textpresso: an ontology-based information retrieval and extraction system for biological literature.

PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions.

Artif Intell Med. 2020 Jan;102:101767. doi: 10.1016/j.artmed.2019.101767. Epub 2019 Nov 28.

UMLS-Interface and UMLS-Similarity : open source software for measuring paths and semantic similarity.

AMIA Annu Symp Proc. 2009 Nov 14;2009:431-5.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Towards a unified search: Improving PubMed retrieval with full text.

J Biomed Inform. 2022 Oct;134:104211. doi: 10.1016/j.jbi.2022.104211. Epub 2022 Sep 21.

引用本文的文献

MCRWR: a new method to measure the similarity of documents based on semantic network.

BMC Bioinformatics. 2022 Feb 1;23(1):56. doi: 10.1186/s12859-022-04578-1.

Large expert-curated database for benchmarking document similarity detection in biomedical literature search.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz085.

Evaluation of standard and semantically-augmented distance metrics for neurology patients.

BMC Med Inform Decis Mak. 2020 Aug 26;20(1):203. doi: 10.1186/s12911-020-01217-8.

Integrating unified medical language system and association mining techniques into relevance feedback for biomedical literature search.

BMC Bioinformatics. 2016 Jul 19;17 Suppl 9(Suppl 9):264. doi: 10.1186/s12859-016-1129-z.

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.

作者信息

机构信息

出版信息

MOTIVATION

METHODS

RESULTS

AVAILABILITY

动机

方法

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献