驯服自由文本中的变异性：在健康监测中的应用。

Taming variability in free text: application to health surveillance.

作者信息

Shapiro Alan R

机构信息

Department of Medicine, New York University School of Medicine, 5 Pheasant Run, Pleasantville, NY 10570, USA.

出版信息

MMWR Suppl. 2004 Sep 24;53:95-100.

PMID:15714636

Abstract

INTRODUCTION

Use of free text in syndromic surveillance requires managing the substantial word variation that results from use of synonyms, abbreviations, acronyms, truncations, concatenations, misspellings, and typographic errors. Failure to detect these variations results in missed cases, and traditional methods for capturing these variations require ongoing, labor-intensive maintenance.

OBJECTIVES

This paper examines the problem of word variation in chief-complaint data and explores three semi-automated approaches for addressing it.

METHODS

Approximately 6 million chief complaints from patients reporting to emergency departments at 54 hospitals were analyzed. A method of text normalization that models the similarities between words was developed to manage the linguistic variability in chief complaints. Three approaches based on this method were investigated: 1) automated correction of spelling and typographical errors; 2) use of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes to select chief complaints to mine for overlooked vocabulary; and 3) identification of overlooked vocabulary by matching words that appeared in similar contexts.

RESULTS

The prevalence of word errors was high. For example, such words as diarrhea, nausea, and vomiting were misspelled 11.0%-18.8% of the time. Approximately 20% of all words were abbreviations or acronyms whose use varied substantially by site. Two methods, use of ICD-9-CM codes to focus searches and the automated pairing of words by context, both retrieved relevant but previously unexpected words. Text normalization simultaneously reduced the number of false positives and false negatives in syndrome classification, compared with commonly used methods based on word stems. In approximately 25% of instances, using text normalization to detect lower respiratory syndrome would have improved the sensitivity of current word-stem approaches by approximately 10%-20%.

CONCLUSIONS

Incomplete vocabulary and word errors can have a substantial impact on the retrieval performance of free-text syndromic surveillance systems. The text normalization methods described in this paper can reduce the effects of these problems.

摘要

引言

在症状监测中使用自由文本需要处理因同义词、缩写词、首字母缩略词、截断词、合并词、拼写错误和排版错误而产生的大量词汇变化。未能检测到这些变化会导致漏诊病例，而捕获这些变化的传统方法需要持续的、耗费人力的维护。

目的

本文研究了主诉数据中的词汇变化问题，并探索了三种半自动方法来解决这一问题。

方法

分析了来自54家医院急诊科患者的约600万条主诉。开发了一种对词汇间相似性进行建模的文本规范化方法，以处理主诉中的语言变异性。研究了基于该方法的三种途径：1）自动纠正拼写和排版错误；2）使用《国际疾病分类，第九版，临床修订本》（ICD-9-CM）编码来选择主诉，以挖掘被忽视的词汇；3）通过匹配出现在相似语境中的词汇来识别被忽视的词汇。

结果

词汇错误的发生率很高。例如，腹泻、恶心和呕吐等词的拼写错误率为11.0%-18.8%。所有词汇中约20%为缩写词或首字母缩略词，其使用在不同地点差异很大。两种方法，即使用ICD-9-CM编码来聚焦搜索以及按语境自动配对词汇，都检索到了相关但此前未预料到的词汇。与基于词干的常用方法相比，文本规范化同时减少了综合征分类中的假阳性和假阴性数量。在大约25%的情况下，使用文本规范化来检测下呼吸道综合征可将当前词干方法的敏感性提高约10%-20%。