Marucci-Wellman Helen R, Lehto Mark R, Corns Helen L
Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, Hopkinton, MA, USA.
School of Industrial Engineering, Purdue University, West Lafayette, IN, USA.
Accid Anal Prev. 2015 Nov;84:165-76. doi: 10.1016/j.aap.2015.06.014. Epub 2015 Sep 26.
Public health surveillance programs in the U.S. are undergoing landmark changes with the availability of electronic health records and advancements in information technology. Injury narratives gathered from hospital records, workers compensation claims or national surveys can be very useful for identifying antecedents to injury or emerging risks. However, classifying narratives manually can become prohibitive for large datasets. The purpose of this study was to develop a human-machine system that could be relatively easily tailored to routinely and accurately classify injury narratives from large administrative databases such as workers compensation. We used a semi-automated approach based on two Naïve Bayesian algorithms to classify 15,000 workers compensation narratives into two-digit Bureau of Labor Statistics (BLS) event (leading to injury) codes. Narratives were filtered out for manual review if the algorithms disagreed or made weak predictions. This approach resulted in an overall accuracy of 87%, with consistently high positive predictive values across all two-digit BLS event categories including the very small categories (e.g., exposure to noise, needle sticks). The Naïve Bayes algorithms were able to identify and accurately machine code most narratives leaving only 32% (4853) for manual review. This strategy substantially reduces the need for resources compared with manual review alone.
随着电子健康记录的普及和信息技术的进步,美国的公共卫生监测项目正在经历具有里程碑意义的变革。从医院记录、工伤赔偿申请或全国性调查中收集到的伤害描述,对于识别伤害的前因或新出现的风险可能非常有用。然而,对于大型数据集来说,手动对这些描述进行分类可能变得代价高昂。本研究的目的是开发一种人机系统,该系统能够相对容易地进行定制,以便从诸如工伤赔偿等大型行政数据库中常规且准确地对伤害描述进行分类。我们使用了基于两种朴素贝叶斯算法的半自动方法,将15000份工伤赔偿描述分类为两位数的劳工统计局(BLS)事件(导致伤害)代码。如果算法存在分歧或做出的预测较弱,则将这些描述筛选出来进行人工审核。这种方法的总体准确率为87%,在所有两位数的BLS事件类别中,包括非常小的类别(如接触噪音、针刺),阳性预测值始终很高。朴素贝叶斯算法能够识别并准确地对大多数描述进行机器编码,仅留下32%(4853份)进行人工审核。与仅进行人工审核相比,这种策略大大减少了资源需求。