Suppr超能文献

构建一个用于呈现医学语言多样性的文本语料库。

Building a text corpus for representing the variety of medical language.

作者信息

Zweigenbaum P, Jacquemart P, Grabar N, Habert B

机构信息

DIAM Service d'Informatique Médicale, Assistance Publique, Hôpitaux de Paris, Département de Biomathématiques, Université Paris 75634 Paris Cedex 13, France.

出版信息

Stud Health Technol Inform. 2001;84(Pt 1):290-4.

Abstract

Medical language processing has focused until recently on a few types of textual documents. However, a much larger variety of document types are used in different settings. It has been showed that Natural Language Processing (NLP) tools can exhibit very different behavior on different types of texts. Without better informed knowledge about the differential performance of NLP tools on a variety of medical text types, it will be difficult to control the extension of their application to different medical documents. We endeavored to provide a basis for such informed assessment: the construction of a large corpus of medical text samples. We propose a framework for designing such a corpus: a set of descriptive dimensions and a standardized encoding of both meta-information (implementing these dimensions) and content. We present a proof of concept demonstration by encoding an initial corpus of text samples according to these principles.

摘要

直到最近,医学语言处理一直集中在少数几种文本类型上。然而,在不同的环境中使用的文档类型要多得多。研究表明,自然语言处理(NLP)工具在不同类型的文本上可能表现出非常不同的行为。如果没有关于NLP工具在各种医学文本类型上的差异性能的更充分信息,将难以控制其应用扩展到不同的医学文档。我们努力为这种明智的评估提供一个基础:构建一个大型医学文本样本语料库。我们提出了一个设计这样一个语料库的框架:一组描述性维度以及元信息(实现这些维度)和内容的标准化编码。我们根据这些原则对文本样本的初始语料库进行编码,给出了一个概念验证演示。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验