信息抽取方法在“伤害监测系统”非传统数据源中的应用：以剪报为例。

Information extraction approaches to unconventional data sources for "Injury Surveillance System": the case of newspapers clippings.

机构信息

Department of Public Health and Microbiology, University of Torino, Torino, Italy.

出版信息

J Med Syst. 2012 Apr;36(2):475-81. doi: 10.1007/s10916-010-9492-1. Epub 2010 Apr 27.

PMID:20703703

Abstract

Injury Surveillance Systems based on traditional hospital records or clinical data have the advantage of being a well established, highly reliable source of information for making an active surveillance on specific injuries, like choking in children. However, they suffer the drawback of delays in making data available to the analysis, due to inefficiencies in data collection procedures. In this sense, the integration of clinical based registries with unconventional data sources like newspaper articles has the advantage of making the system more useful for early alerting. Usage of such sources is difficult since information is only available in the form of free natural-language documents rather than structured databases as required by traditional data mining techniques. Information Extraction (IE) addresses the problem of transforming a corpus of textual documents into a more structured database. In this paper, on a corpora of Italian newspapers articles related to choking in children due to ingestion/inhalation of foreign body we compared the performance of three IE algorithms- (a) a classical rule based system which requires a manual annotation of the rules; (ii) a rule based system which allows for the automatic building of rules; (b) a machine learning method based on Support Vector Machine. Although some useful indications are extracted from the newspaper clippings, this approach is at the time far from being routinely implemented for injury surveillance purposes.

摘要

基于传统医院记录或临床数据的伤害监测系统具有成为特定伤害（如儿童窒息）主动监测的可靠、高度可靠信息来源的优势。然而，由于数据收集程序效率低下，它们存在数据可供分析的延迟的缺点。在这方面，将基于临床的登记处与非传统数据来源（如报纸文章）集成具有使系统更有利于早期警报的优势。由于信息仅以免费自然语言文档的形式提供，而不是传统数据挖掘技术所需的结构化数据库，因此使用此类来源具有一定的难度。信息提取 (IE) 解决了将文本文档语料库转换为更结构化数据库的问题。在本文中，我们比较了三种 IE 算法在与儿童因吞食/吸入异物而窒息的意大利报纸文章语料库上的性能：(a) 需要手动注释规则的经典基于规则的系统；(ii) 允许自动构建规则的基于规则的系统；(b) 基于支持向量机的机器学习方法。尽管从剪报中提取了一些有用的信息，但这种方法目前远未常规用于伤害监测目的。