Suppr超能文献

通过对所有医学主题词描述符的四种检索策略进行自动性能评估来确定查询PubMed的最佳语义扩展:比较研究

Identification of the Best Semantic Expansion to Query PubMed Through Automatic Performance Assessment of Four Search Strategies on All Medical Subject Heading Descriptors: Comparative Study.

作者信息

Massonnaud Clément R, Kerdelhué Gaétan, Grosjean Julien, Lelong Romain, Griffon Nicolas, Darmoni Stefan J

机构信息

Department of Biomedical Informatics, Rouen University Hospital, Rouen, France.

Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France.

出版信息

JMIR Med Inform. 2020 Jun 4;8(6):e12799. doi: 10.2196/12799.

Abstract

BACKGROUND

With the continuous expansion of available biomedical data, efficient and effective information retrieval has become of utmost importance. Semantic expansion of queries using synonyms may improve information retrieval.

OBJECTIVE

The aim of this study was to automatically construct and evaluate expanded PubMed queries of the form "preferred term"[MH] OR "preferred term"[TIAB] OR "synonym 1"[TIAB] OR "synonym 2"[TIAB] OR …, for each of the 28,313 Medical Subject Heading (MeSH) descriptors, by using different semantic expansion strategies. We sought to propose an innovative method that could automatically evaluate these strategies, based on the three main metrics used in information science (precision, recall, and F-measure).

METHODS

Three semantic expansion strategies were assessed. They differed by the synonyms used to build the queries as follows: MeSH synonyms, Unified Medical Language System (UMLS) mappings, and custom mappings (Catalogue et Index des Sites Médicaux de langue Française [CISMeF]). The precision, recall, and F-measure metrics were automatically computed for the three strategies and for the standard automatic term mapping (ATM) of PubMed. The method to automatically compute the metrics involved computing the number of all relevant citations (A), using National Library of Medicine indexing as the gold standard ("preferred term"[MH]), the number of citations retrieved by the added terms ("synonym 1"[TIAB] OR "synonym 2"[TIAB] OR …) (B), and the number of relevant citations retrieved by the added terms (combining the previous two queries with an "AND" operator) (C). It was possible to programmatically compute the metrics for each strategy using each of the 28,313 MeSH descriptors as a "preferred term," corresponding to 239,724 different queries built and sent to the PubMed application program interface. The four search strategies were ranked and compared for each metric.

RESULTS

ATM had the worst performance for all three metrics among the four strategies. The MeSH strategy had the best mean precision (51%, SD 23%). The UMLS strategy had the best recall and F-measure (41%, SD 31% and 36%, SD 24%, respectively). CISMeF had the second best recall and F-measure (40%, SD 31% and 35%, SD 24%, respectively). However, considering a cutoff of 5%, CISMeF had better precision than UMLS for 1180 descriptors, better recall for 793 descriptors, and better F-measure for 678 descriptors.

CONCLUSIONS

This study highlights the importance of using semantic expansion strategies to improve information retrieval. However, the performances of a given strategy, relatively to another, varied greatly depending on the MeSH descriptor. These results confirm there is no ideal search strategy for all descriptors. Different semantic expansions should be used depending on the descriptor and the user's objectives. Thus, we developed an interface that allows users to input a descriptor and then proposes the best semantic expansion to maximize the three main metrics (precision, recall, and F-measure).

摘要

背景

随着可用生物医学数据的不断扩展,高效且有效的信息检索变得至关重要。使用同义词对查询进行语义扩展可能会改善信息检索。

目的

本研究的目的是通过使用不同的语义扩展策略,为28313个医学主题词(MeSH)描述符中的每一个自动构建并评估形式为“首选词”[MH] 或 “首选词”[TIAB] 或 “同义词1”[TIAB] 或 “同义词2”[TIAB] 或 …… 的扩展PubMed查询。我们试图提出一种创新方法,该方法可以基于信息科学中使用的三个主要指标(精确率、召回率和F值)自动评估这些策略。

方法

评估了三种语义扩展策略。它们在用于构建查询的同义词方面有所不同,具体如下:MeSH同义词、统一医学语言系统(UMLS)映射和自定义映射(法语医学网站目录及索引 [CISMeF])。针对这三种策略以及PubMed的标准自动术语映射(ATM),自动计算精确率、召回率和F值指标。自动计算指标的方法涉及使用美国国立医学图书馆的索引作为金标准(“首选词”[MH])来计算所有相关引文的数量(A),添加的词(“同义词1”[TIAB] 或 “同义词2”[TIAB] 或 ……)检索到的引文数量(B),以及添加的词检索到的相关引文数量(将前两个查询用“AND”运算符组合)(C)。使用28313个MeSH描述符中的每一个作为“首选词”,可以通过编程为每种策略计算指标,这对应于构建并发送到PubMed应用程序接口的239724个不同查询。针对每个指标对这四种搜索策略进行排名和比较。

结果

在这四种策略中,ATM在所有三个指标上的表现最差。MeSH策略的平均精确率最高(51%,标准差23%)。UMLS策略的召回率和F值最高(分别为41%,标准差31%和36%,标准差24%)。CISMeF的召回率和F值次之(分别为40%,标准差31%和35%,标准差24%)。然而,考虑到5%的截断值,对于1180个描述符,CISMeF的精确率优于UMLS,对于793个描述符,召回率更高,对于678个描述符,F值更高。

结论

本研究强调了使用语义扩展策略来改善信息检索的重要性。然而,给定策略相对于另一种策略的性能,因MeSH描述符的不同而有很大差异。这些结果证实,对于所有描述符不存在理想的搜索策略。应根据描述符和用户目标使用不同的语义扩展。因此,我们开发了一个界面,允许用户输入一个描述符,然后提出最佳的语义扩展,以最大化三个主要指标(精确率、召回率和F值)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/271f/7303830/76ffa987b3a5/medinform_v8i6e12799_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验