The ISIS Center, Georgetown University Medical Center, Washington, DC, USA.
Int J Med Inform. 2011 Jan;80(1):56-66. doi: 10.1016/j.ijmedinf.2010.10.015. Epub 2010 Dec 4.
Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles.
Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers.
Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period.
The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.
传染病爆发的早期检测对于保护社会公众健康至关重要。在线新闻文章提供了全球疾病爆发的及时信息。在本研究中,我们使用机器学习分类器研究了自动检测与疾病爆发相关的文章。在实际情况下,为分类器准备训练数据集是昂贵的,该数据集通常由手动标记的相关和不相关文章组成。为了缓解这一挑战,我们研究了使用随机采样的未标记文章以及标记的相关文章。
朴素贝叶斯和支持向量机(SVM)分类器在 149 篇相关文章和 149 篇或更多随机采样的未标记文章上进行了训练。通过改变采样的未标记文章数量和词特征数量,对不同的分类器进行了训练。将训练好的分类器应用于 15000 篇在 15 天内发布的文章。从每个分类器中排名最高的文章中进行汇总,并由专家分析师对结果进行审查,以评估分类器。
在 15 天的评估期间,朴素贝叶斯和 SVM 分类器的 ROC 曲线下面积(AUC)的日平均值分别为 0.841 和 0.836。我们参考了疾病爆发报告数据库,以确认该评估数据集确实是通过汇总方法从评估期间数据库中记录的事件中得出的。
该方法利用随机采样的未标记文章提出的文本分类框架可以在基于互联网的生物监测项目中为训练机器学习分类器提供一种具有成本效益的方法。我们计划使用更大的数据集和非英语语言的文章进一步研究该框架。