Nanda Gaurav, Grattan Kathleen M, Chu MyDzung T, Davis Letitia K, Lehto Mark R
School of Industrial Engineering, Purdue University, 315 N. Grant Street, West Lafayette, IN 47907-2023, USA.
Massachusetts Department of Public Health, 250 Washington Street, 4th Floor, Boston, MA 02108, USA.
J Safety Res. 2016 Jun;57:71-82. doi: 10.1016/j.jsr.2016.03.001. Epub 2016 Mar 15.
Studies on autocoding injury data have found that machine learning algorithms perform well for categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autocode a large portion of the data, filter cases for manual review, and assist human coders by presenting them top k prediction choices and a confusion matrix of predictions from Bayesian models.
We studied the prediction performance of Single-Word (SW) and Two-Word-Sequence (TW) Naïve Bayes models on a sample of data from the 2011 Survey of Occupational Injury and Illness (SOII). We used the agreement in prediction results of SW and TW models, and various prediction strength thresholds for autocoding and filtering cases for manual review. We also studied the sensitivity of the top k predictions of the SW model, TW model, and SW-TW combination, and then compared the accuracy of the manually assigned codes to SOII data with that of the proposed system.
The accuracy of the proposed system, assuming well-trained coders reviewing a subset of only 26% of cases flagged for review, was estimated to be comparable (86.5%) to the accuracy of the original coding of the data set (range: 73%-86.8%). Overall, the TW model had higher sensitivity than the SW model, and the accuracy of the prediction results increased when the two models agreed, and for higher prediction strength thresholds. The sensitivity of the top five predictions was 93%.
The proposed system seems promising for coding injury data as it offers comparable accuracy and less manual coding.
Accurate and timely coded occupational injury data is useful for surveillance as well as prevention activities that aim to make workplaces safer.
对自动编码损伤数据的研究发现,机器学习算法在处理频繁出现的类别时表现良好,但在处理罕见类别时往往存在困难。因此,尽管手动编码资源密集,但无法被淘汰。我们提出了一种贝叶斯决策支持系统,用于对大部分数据进行自动编码,筛选出需要人工审核的案例,并通过向人工编码人员展示前k个预测选择和贝叶斯模型的预测混淆矩阵来协助他们。
我们在2011年职业伤害和疾病调查(SOII)的一部分数据样本上研究了单字(SW)和双字序列(TW)朴素贝叶斯模型的预测性能。我们利用SW和TW模型预测结果的一致性,以及各种预测强度阈值来进行自动编码和筛选需要人工审核的案例。我们还研究了SW模型、TW模型和SW-TW组合的前k个预测的敏感性,然后将人工分配给SOII数据的编码准确性与所提出系统的准确性进行比较。
假设训练有素的编码人员只审核标记为审核的26%的案例子集,所提出系统的准确性估计与数据集原始编码的准确性相当(86.5%)(范围:73%-86.8%)。总体而言,TW模型比SW模型具有更高的敏感性,当两个模型达成一致时,预测结果的准确性会提高,并且对于更高的预测强度阈值也是如此。前五个预测的敏感性为93%。
所提出的系统在编码损伤数据方面似乎很有前景,因为它提供了相当的准确性且减少了人工编码。
准确及时编码的职业伤害数据对于监测以及旨在使工作场所更安全的预防活动非常有用。