Suppr超能文献

生物医学动词次范畴化资源的获取和评估。

Acquisition and evaluation of verb subcategorization resources for biomedicine.

机构信息

Computer Laboratory, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK.

出版信息

J Biomed Inform. 2013 Apr;46(2):228-37. doi: 10.1016/j.jbi.2013.01.001. Epub 2013 Jan 22.

Abstract

BACKGROUND

Biomedical natural language processing (NLP) applications that have access to detailed resources about the linguistic characteristics of biomedical language demonstrate improved performance on tasks such as relation extraction and syntactic or semantic parsing. Such applications are important for transforming the growing unstructured information buried in the biomedical literature into structured, actionable information. In this paper, we address the creation of linguistic resources that capture how individual biomedical verbs behave. We specifically consider verb subcategorization, or the tendency of verbs to "select" co-occurrence with particular phrase types, which influences the interpretation of verbs and identification of verbal arguments in context. There are currently a limited number of biomedical resources containing information about subcategorization frames (SCFs), and these are the result of either labor-intensive manual collation, or automatic methods that use tools adapted to a single biomedical subdomain. Either method may result in resources that lack coverage. Moreover, the quality of existing verb SCF resources for biomedicine is unknown, due to a lack of available gold standards for evaluation.

RESULTS

This paper presents three new resources related to verb subcategorization frames in biomedicine, and four experiments making use of the new resources. We present the first biomedical SCF gold standards, capturing two different but widely-used definitions of subcategorization, and a new SCF lexicon, BioCat, covering a large number of biomedical sub-domains. We evaluate the SCF acquisition methodologies for BioCat with respect to the gold standards, and compare the results with the accuracy of the only previously existing automatically-acquired SCF lexicon for biomedicine, the BioLexicon. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. Finally, we explore the definition of subcategorization using these resources and its implications for biomedical NLP. All resources are made publicly available.

CONCLUSION

The SCF resources we have evaluated still show considerably lower accuracy than that reported with general English lexicons, demonstrating the need for domain- and subdomain-specific SCF acquisition tools for biomedicine. Our new gold standards reveal major differences when annotators use the different definitions. Moreover, evaluation of BioCat yields major differences in accuracy depending on the gold standard, demonstrating that the definition of subcategorization adopted will have a direct impact on perceived system accuracy for specific tasks.

摘要

背景

具有访问有关生物医学语言语言特征的详细资源的生物医学自然语言处理(NLP)应用程序在关系提取和句法或语义解析等任务上表现出更好的性能。此类应用程序对于将生物医学文献中不断增长的非结构化信息转化为结构化的可操作信息非常重要。在本文中,我们解决了创建捕获单个生物医学动词行为方式的语言资源的问题。我们特别考虑动词的次分类,即动词“选择”与特定短语类型共现的倾向,这会影响动词的解释和上下文动词论点的识别。目前,包含次分类框架(SCF)信息的生物医学资源数量有限,这些资源是劳动密集型手动整理的结果,或者是使用适应单个生物医学子领域的工具的自动方法的结果。这两种方法都可能导致资源缺乏覆盖。此外,由于缺乏可用于评估的可用黄金标准,因此尚不清楚现有的生物医学动词 SCF 资源的质量。

结果

本文提出了三个与生物医学中的动词次分类框架相关的新资源,以及四个利用新资源的实验。我们提出了第一个生物医学 SCF 黄金标准,捕获了次分类的两种不同但广泛使用的定义,以及一个新的生物医学 SCF 词汇表 BioCat,涵盖了大量的生物医学子领域。我们根据黄金标准评估了 BioCat 的 SCF 获取方法,并将结果与生物医学中唯一先前存在的自动获取的 SCF 词汇表 BioLexicon 的准确性进行了比较。我们的结果表明,BioLexicon 的精度更高,而 BioCat 的 SCF 覆盖率更好。最后,我们使用这些资源探索次分类的定义及其对生物医学 NLP 的影响。所有资源均公开提供。

结论

我们评估的 SCF 资源的准确性仍然明显低于使用普通英语词汇表报告的准确性,这表明需要针对生物医学领域和子领域的特定 SCF 获取工具。我们的新黄金标准揭示了注释者使用不同定义时的主要差异。此外,根据黄金标准,对 BioCat 的评估会导致准确性的重大差异,这表明所采用的次分类定义将直接影响特定任务的感知系统准确性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验