Garcia Castro Leyla Jael, Berlanga Rafael, Garcia Alexander
Temporal Knowledge Bases Group, Department of Computer Languages and Systems, Universitat Jaume I, 12071 Castelló de la Plana, Spain.
Temporal Knowledge Bases Group, Department of Computer Languages and Systems, Universitat Jaume I, 12071 Castelló de la Plana, Spain.
J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.
Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript.
We have benchmarked three similarity metrics - BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles.
We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles.
Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323.
尽管出版商以电子格式提供全文文章,但在标题和摘要上下文之外查找相关工作仍然是一项挑战。基于摘要识别相关文章确实是一个很好的起点;这个过程很简单,并且不会像基于全文的相似度那样消耗大量资源。然而,进一步的分析可能需要深入理解全文内容。两篇摘要高度相关的文章在全文内容上可能有很大差异。考虑标题和摘要与全文时相似度如何不同,以及在处理全文文章时哪种语义相似度度量能提供更好的结果,是本手稿要解决的主要问题。
我们对三种相似度度量——BM25、PMRA和余弦相似度进行了基准测试,以确定在对全文文档使用基于概念的注释时哪种度量表现最佳。我们还评估了基于标题和摘要的相似度值与基于全文的相似度值之间的差异。我们的测试数据集包括2005年文本检索会议的基因组学赛道文章集。最初,我们使用实体识别软件用统一医学语言系统(UMLS®)中定义的概念对标题、摘要以及全文进行语义注释。对于每篇文章,我们创建了一个文档概要,即一组识别出的概念、词频和逆文档频率;然后我们将各种相似度度量应用于这些文档概要。我们考虑了相关性、精确率、召回率和F1值,以确定哪种相似度度量在基于概念的注释中表现最佳。对于那些可在PubMed Central开放获取(PMC - OA)中的全文文章,我们还进行了离散度分析,以了解在考虑全文文章时相似度是如何变化的。
我们发现PubMed相关文章相似度度量最适合用UMLS概念注释的全文文章。对于相似度值高于0.8的情况,所有度量的F1值约为0.2,召回率约为0.1;BM25的精确率最高,接近1;在所有情况下,基于概念的度量比基于词干的度量表现更好。我们的实验表明,仅考虑标题和摘要与全文相似度时相似度值会有所不同。因此,当特定研究需要超越标题和摘要,特别是在文章之间的关联性方面时,基于全文的分析就变得很有用。
可视化可在ljgarcia.github.io/semsim.benchmark/获取,数据可在http://dx.doi.org/10.5281/zenodo.13323获取。