National Institute of Informatics, Chiyoda-ku, Tokyo, Japan.
BMC Med Inform Decis Mak. 2010 Jan 12;10:1. doi: 10.1186/1472-6947-10-1.
Current public concern over the spread of infectious diseases has underscored the importance of health surveillance systems for the speedy detection of disease outbreaks. Several international report-based monitoring systems have been developed, including GPHIN, Argus, HealthMap, and BioCaster. A vital feature of these report-based systems is the geo-temporal encoding of outbreak-related textual data. Until now, automated systems have tended to use an ad-hoc strategy for processing geo-temporal information, normally involving the detection of locations that match pre-determined criteria, and the use of document publication dates as a proxy for disease event dates. Although these strategies appear to be effective enough for reporting events at the country and province levels, they may be less effective at discovering geo-temporal information at more detailed levels of granularity. In order to improve the capabilities of current Web-based health surveillance systems, we introduce the design for a novel scheme called spatiotemporal zoning.
The proposed scheme classifies news articles into zones according to the spatiotemporal characteristics of their content. In order to study the reliability of the annotation scheme, we analyzed the inter-annotator agreements on a group of human annotators for over 1000 reported events. Qualitative and quantitative evaluation is made on the results including the kappa and percentage agreement.
The reliability evaluation of our scheme yielded very promising inter-annotator agreement, more than a 0.9 kappa and a 0.9 percentage agreement for event type annotation and temporal attributes annotation, respectively, with a slight degradation for the spatial attribute. However, for events indicating an outbreak situation, the annotators usually had inter-annotator agreements with the lowest granularity location.
We developed and evaluated a novel spatiotemporal zoning annotation scheme. The results of the scheme evaluation indicate that our annotated corpus and the proposed annotation scheme are reliable and could be effectively used for developing an automatic system. Given the current advances in natural language processing techniques, including the availability of language resources and tools, we believe that a reliable automatic spatiotemporal zoning system can be achieved. In the next stage of this work, we plan to develop an automatic zoning system and evaluate its usability within an operational health surveillance system.
当前公众对传染病传播的关注凸显了健康监测系统对于快速发现疾病暴发的重要性。已经开发了几种基于国际报告的监测系统,包括 GPHIN、Argus、HealthMap 和 BioCaster。这些基于报告的系统的一个重要特征是对与暴发相关的文本数据进行地理时间编码。到目前为止,自动化系统往往倾向于使用特定策略来处理地理时间信息,通常涉及检测与预定义标准匹配的位置,并使用文档发布日期作为疾病事件日期的代理。尽管这些策略对于报告国家和省级别的事件似乎已经足够有效,但它们在发现更详细粒度的地理时间信息时可能效果较差。为了提高当前基于 Web 的健康监测系统的能力,我们引入了一种名为时空分区的新方案的设计。
该方案根据内容的时空特征将新闻文章分类到不同的区域。为了研究注释方案的可靠性,我们分析了一组超过 1000 个报告事件的人工注释者之间的注释者间一致性。对包括kappa 和百分比一致性在内的结果进行定性和定量评估。
我们的方案可靠性评估得出了非常有希望的注释者间一致性,对于事件类型注释和时间属性注释,kappa 值分别超过 0.9 和 0.9,百分比一致性分别为 0.9,对于空间属性略有下降。然而,对于表示暴发情况的事件,注释者通常对最低粒度的位置具有最低的注释者间一致性。
我们开发并评估了一种新的时空分区注释方案。方案评估的结果表明,我们的标注语料库和提出的注释方案是可靠的,可以有效地用于开发自动系统。鉴于自然语言处理技术的当前进展,包括语言资源和工具的可用性,我们相信可以实现可靠的自动时空分区系统。在这项工作的下一阶段,我们计划开发一个自动分区系统,并在操作健康监测系统中评估其可用性。