基于两个现有标准信息模型的自然语言处理通用数据模型：CDA+GrAF。

Common data model for natural language processing based on two existing standard information models: CDA+GrAF.

机构信息

Department of Biomedical Informatics, University of Utah, School of Medicine, Salt Lake City, UT 84112, USA.

出版信息

J Biomed Inform. 2012 Aug;45(4):703-10. doi: 10.1016/j.jbi.2011.11.018. Epub 2011 Dec 8.

DOI:10.1016/j.jbi.2011.11.018

PMID:22197801

Abstract

An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled "CDA+GrAF". We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and "translating" annotations between different NLP applications, and eventually "plug-and-play" of different modules in NLP applications.

摘要

自然语言处理（NLP）研究和开发社区对协作和资源共享的需求不断增加，这促使人们努力创建和共享一个通用的数据模型和术语，用于标注和提取临床文本中的所有信息。我们结合了两个现有的标准：HL7 临床文档架构（CDA）和 ISO 图形标注格式（GrAF；正在开发中），开发了这样一个名为“CDA+GrAF”的数据模型。我们尝试了几种方法来组合这些现有标准，最终选择了一种方法，即将单独的 CDA 和 GrAF 部分包装在一个通用的分隔标注（即与标注文本分开）XML 文档中。我们使用了两个用例，临床文档部分和 2010 年 i2b2/VA NLP 挑战赛（即问题、测试和治疗及其断言和关系），来创建这种分隔标注文档的示例，并使用这两个标准提供的 XML 模式成功验证了这些示例。我们开发了一个工具，可将 2010 年 i2b2/VA NLP 挑战赛格式的标注文档自动转换为 GrAF，并使用该工具自动生成了 50 个标注文档，所有文档均成功验证。最后，我们修改了 HL7 CDA 提供的 XSL 样式表，以便在 Web 浏览器中查看标注 XML 文档，并计划修改现有的工具，以便在 CDA+GrAF 与 UIMA 和 GATE 框架之间转换标注文档。这个通用的数据模型可以方便地直接比较 NLP 工具和应用程序，合并它们的输出，在不同的 NLP 应用程序之间转换和“翻译”标注，并最终实现不同模块在 NLP 应用程序中的“即插即用”。