Suppr超能文献

基于文本的突发传染病网络监测研究框架的探索性研究

An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics.

机构信息

The ISIS Center, Georgetown University Medical Center, Washington, DC, USA.

出版信息

Int J Med Inform. 2011 Jan;80(1):56-66. doi: 10.1016/j.ijmedinf.2010.10.015. Epub 2010 Dec 4.

Abstract

PURPOSE

Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles.

METHODS

Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers.

RESULTS

Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period.

CONCLUSIONS

The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.

摘要

目的

传染病爆发的早期检测对于保护社会公众健康至关重要。在线新闻文章提供了全球疾病爆发的及时信息。在本研究中,我们使用机器学习分类器研究了自动检测与疾病爆发相关的文章。在实际情况下,为分类器准备训练数据集是昂贵的,该数据集通常由手动标记的相关和不相关文章组成。为了缓解这一挑战,我们研究了使用随机采样的未标记文章以及标记的相关文章。

方法

朴素贝叶斯和支持向量机(SVM)分类器在 149 篇相关文章和 149 篇或更多随机采样的未标记文章上进行了训练。通过改变采样的未标记文章数量和词特征数量,对不同的分类器进行了训练。将训练好的分类器应用于 15000 篇在 15 天内发布的文章。从每个分类器中排名最高的文章中进行汇总,并由专家分析师对结果进行审查,以评估分类器。

结果

在 15 天的评估期间,朴素贝叶斯和 SVM 分类器的 ROC 曲线下面积(AUC)的日平均值分别为 0.841 和 0.836。我们参考了疾病爆发报告数据库,以确认该评估数据集确实是通过汇总方法从评估期间数据库中记录的事件中得出的。

结论

该方法利用随机采样的未标记文章提出的文本分类框架可以在基于互联网的生物监测项目中为训练机器学习分类器提供一种具有成本效益的方法。我们计划使用更大的数据集和非英语语言的文章进一步研究该框架。

相似文献

5
Biosurveillance, classification, and semantic health technologies.生物监测、分类及语义健康技术。
J Am Med Inform Assoc. 2008 Mar-Apr;15(2):172-3. doi: 10.1197/jamia.m2693.

引用本文的文献

4
Automatic Annotation of Narrative Radiology Reports.叙事性放射学报告的自动标注
Diagnostics (Basel). 2020 Apr 1;10(4):196. doi: 10.3390/diagnostics10040196.
9
A review of evaluations of electronic event-based biosurveillance systems.基于事件的电子生物监测系统评估综述。
PLoS One. 2014 Oct 20;9(10):e111222. doi: 10.1371/journal.pone.0111222. eCollection 2014.

本文引用的文献

1
Automatic online news monitoring and classification for syndromic surveillance.用于症状监测的自动在线新闻监测与分类
Decis Support Syst. 2009 Nov;47(4):508-517. doi: 10.1016/j.dss.2009.04.016. Epub 2009 May 4.
2
Landscape of international event-based biosurveillance.基于事件的国际生物监测概况。
Emerg Health Threats J. 2010;3:e3. doi: 10.3134/ehtj.10.003. Epub 2010 Feb 19.
5
Classifying disease outbreak reports using n-grams and semantic features.利用 n 元组和语义特征对疾病爆发报告进行分类。
Int J Med Inform. 2009 Dec;78(12):e47-58. doi: 10.1016/j.ijmedinf.2009.03.010. Epub 2009 May 15.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验