Zweigenbaum P, Jacquemart P, Grabar N, Habert B
DIAM Service d'Informatique Médicale, Assistance Publique, Hôpitaux de Paris, Département de Biomathématiques, Université Paris 75634 Paris Cedex 13, France.
Stud Health Technol Inform. 2001;84(Pt 1):290-4.
Medical language processing has focused until recently on a few types of textual documents. However, a much larger variety of document types are used in different settings. It has been showed that Natural Language Processing (NLP) tools can exhibit very different behavior on different types of texts. Without better informed knowledge about the differential performance of NLP tools on a variety of medical text types, it will be difficult to control the extension of their application to different medical documents. We endeavored to provide a basis for such informed assessment: the construction of a large corpus of medical text samples. We propose a framework for designing such a corpus: a set of descriptive dimensions and a standardized encoding of both meta-information (implementing these dimensions) and content. We present a proof of concept demonstration by encoding an initial corpus of text samples according to these principles.
直到最近,医学语言处理一直集中在少数几种文本类型上。然而,在不同的环境中使用的文档类型要多得多。研究表明,自然语言处理(NLP)工具在不同类型的文本上可能表现出非常不同的行为。如果没有关于NLP工具在各种医学文本类型上的差异性能的更充分信息,将难以控制其应用扩展到不同的医学文档。我们努力为这种明智的评估提供一个基础:构建一个大型医学文本样本语料库。我们提出了一个设计这样一个语料库的框架:一组描述性维度以及元信息(实现这些维度)和内容的标准化编码。我们根据这些原则对文本样本的初始语料库进行编码,给出了一个概念验证演示。