Computer Science Department, Data Science Institute, Columbia University, New York, NY, USA.
Bureau of Communicable Disease, New York City Department of Health and Mental Hygiene, Queens, NY, USA.
J Am Med Inform Assoc. 2018 Dec 1;25(12):1586-1592. doi: 10.1093/jamia/ocx093.
We developed a system for the discovery of foodborne illness mentioned in online Yelp restaurant reviews using text classification. The system is used by the New York City Department of Health and Mental Hygiene (DOHMH) to monitor Yelp for foodborne illness complaints.
We built classifiers for 2 tasks: (1) determining if a review indicated a person experiencing foodborne illness and (2) determining if a review indicated multiple people experiencing foodborne illness. We first developed a prototype classifier in 2012 for both tasks using a small labeled dataset. Over years of system deployment, DOHMH epidemiologists labeled 13 526 reviews selected by this classifier. We used these biased data and a sample of complementary reviews in a principled bias-adjusted training scheme to develop significantly improved classifiers. Finally, we performed an error analysis of the best resulting classifiers.
We found that logistic regression trained with bias-adjusted augmented data performed best for both classification tasks, with F1-scores of 87% and 66% for tasks 1 and 2, respectively.
Our error analysis revealed that the inability of our models to account for long phrases caused the most errors. Our bias-adjusted training scheme illustrates how to improve a classification system iteratively by exploiting available biased labeled data.
Our system has been instrumental in the identification of 10 outbreaks and 8523 complaints of foodborne illness associated with New York City restaurants since July 2012. Our evaluation has identified strong classifiers for both tasks, whose deployment will allow DOHMH epidemiologists to more effectively monitor Yelp for foodborne illness investigations.
我们开发了一个使用文本分类在在线 Yelp 餐厅评论中发现食源性疾病的系统。该系统由纽约市卫生局(DOHMH)用于监测 Yelp 上的食源性疾病投诉。
我们为 2 个任务构建了分类器:(1)确定评论是否表明有人患有食源性疾病,(2)确定评论是否表明多人患有食源性疾病。我们首先在 2012 年使用一个小的标记数据集为这两个任务开发了一个原型分类器。在系统部署的多年中,DOHMH 流行病学家标记了这个分类器选择的 13526 条评论。我们使用这些有偏差的数据和一个补充评论的样本,在一个有原则的偏差调整训练方案中,开发了显著改进的分类器。最后,我们对最佳分类器进行了错误分析。
我们发现,使用偏差调整增强数据训练的逻辑回归在两个分类任务中表现最好,分别为 87%和 66%的 F1 分数。
我们的错误分析表明,我们的模型无法解释长短语是造成错误的主要原因。我们的偏差调整训练方案说明了如何通过利用可用的有偏差的标记数据来迭代地改进分类系统。
自 2012 年 7 月以来,我们的系统已经在识别与纽约市餐馆有关的 10 起暴发和 8523 起食源性疾病投诉方面发挥了重要作用。我们的评估已经确定了两个任务的强大分类器,它们的部署将使 DOHMH 流行病学家能够更有效地监测 Yelp 上的食源性疾病调查。