基于文本的突发传染病网络监测研究框架的探索性研究

An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics.

机构信息

The ISIS Center, Georgetown University Medical Center, Washington, DC, USA.

出版信息

Int J Med Inform. 2011 Jan;80(1):56-66. doi: 10.1016/j.ijmedinf.2010.10.015. Epub 2010 Dec 4.

DOI:10.1016/j.ijmedinf.2010.10.015

PMID:21134784

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3904285/

Abstract

PURPOSE

Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles.

METHODS

Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers.

RESULTS

Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period.

CONCLUSIONS

The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.

摘要

目的

传染病爆发的早期检测对于保护社会公众健康至关重要。在线新闻文章提供了全球疾病爆发的及时信息。在本研究中，我们使用机器学习分类器研究了自动检测与疾病爆发相关的文章。在实际情况下，为分类器准备训练数据集是昂贵的，该数据集通常由手动标记的相关和不相关文章组成。为了缓解这一挑战，我们研究了使用随机采样的未标记文章以及标记的相关文章。

方法

朴素贝叶斯和支持向量机（SVM）分类器在 149 篇相关文章和 149 篇或更多随机采样的未标记文章上进行了训练。通过改变采样的未标记文章数量和词特征数量，对不同的分类器进行了训练。将训练好的分类器应用于 15000 篇在 15 天内发布的文章。从每个分类器中排名最高的文章中进行汇总，并由专家分析师对结果进行审查，以评估分类器。

结果

在 15 天的评估期间，朴素贝叶斯和 SVM 分类器的 ROC 曲线下面积（AUC）的日平均值分别为 0.841 和 0.836。我们参考了疾病爆发报告数据库，以确认该评估数据集确实是通过汇总方法从评估期间数据库中记录的事件中得出的。

结论

该方法利用随机采样的未标记文章提出的文本分类框架可以在基于互联网的生物监测项目中为训练机器学习分类器提供一种具有成本效益的方法。我们计划使用更大的数据集和非英语语言的文章进一步研究该框架。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于文本的突发传染病网络监测研究框架的探索性研究

An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

基于文本的突发传染病网络监测研究框架的探索性研究

An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献