Suppr超能文献

利用引文语境提高生物医学文献的 MeSH 分类

Improving MeSH classification of biomedical articles using citation contexts.

机构信息

Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

出版信息

J Biomed Inform. 2011 Oct;44(5):881-96. doi: 10.1016/j.jbi.2011.05.007. Epub 2011 Jun 12.

Abstract

Medical Subject Headings (MeSH) are used to index the majority of databases generated by the National Library of Medicine. Essentially, MeSH terms are designed to make information, such as scientific articles, more retrievable and assessable to users of systems such as PubMed. This paper proposes a novel method for automating the assignment of biomedical publications with MeSH terms that takes advantage of citation references to these publications. Our findings show that analysing the citation references that point to a document can provide a useful source of terms that are not present in the document. The use of these citation contexts, as they are known, can thus help to provide a richer document feature representation, which in turn can help improve text mining and information retrieval applications, in our case MeSH term classification. In this paper, we also explore new methods of selecting and utilising citation contexts. In particular, we assess the effect of weighting the importance of citation terms (found in the citation contexts) according to two aspects: (i) the section of the paper they appear in and (ii) their distance to the citation marker. We conduct intrinsic and extrinsic evaluations of citation term quality. For the intrinsic evaluation, we rely on the UMLS Metathesaurus conceptual database to explore the semantic characteristics of the mined citation terms. We also analyse the "informativeness" of these terms using a class-entropy measure. For the extrinsic evaluation, we run a series of automatic document classification experiments over MeSH terms. Our experimental evaluation shows that citation contexts contain terms that are related to the original document, and that the integration of this knowledge results in better classification performance compared to two state-of-the-art MeSH classification systems: MeSHUP and MTI. Our experiments also demonstrate that the consideration of Section and Distance factors can lead to statistically significant improvements in citation feature quality, thus opening the way for better document feature representation in other biomedical text processing applications.

摘要

医学主题词(MeSH)用于索引美国国家医学图书馆生成的大多数数据库。从本质上讲,MeSH 术语旨在使信息(如科学文章)对 PubMed 等系统的用户更具可检索性和可评估性。本文提出了一种利用生物医学出版物的引文引用为这些出版物分配 MeSH 术语的新方法。我们的研究结果表明,分析指向文档的引文引用可以提供一个有用的术语来源,这些术语在文档中不存在。这些引文上下文(如已知的)的使用可以为文档的特征表示提供更丰富的信息,从而有助于提高文本挖掘和信息检索应用程序的性能,在我们的案例中是 MeSH 术语分类。在本文中,我们还探索了选择和利用引文上下文的新方法。特别是,我们根据两个方面评估了根据引文术语(在引文上下文中找到)的重要性对其进行加权的效果:(i)它们出现在论文中的部分和(ii)它们与引文标记的距离。我们对引文术语的质量进行了内在和外在的评估。对于内在评估,我们依赖 UMLS Metathesaurus 概念数据库来探索挖掘的引文术语的语义特征。我们还使用类别熵度量来分析这些术语的“信息量”。对于外在评估,我们对 MeSH 术语进行了一系列自动文档分类实验。我们的实验评估表明,引文上下文中包含与原始文档相关的术语,并且与两个最先进的 MeSH 分类系统(MeSHUP 和 MTI)相比,整合这些知识可以带来更好的分类性能。我们的实验还表明,考虑部分和距离因素可以导致引文特征质量的统计显著提高,从而为其他生物医学文本处理应用程序中的更好的文档特征表示开辟了道路。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验