Nahorniak Jasmine, Bovbjerg Viktor, Case Samantha, Kincl Laurel
College of Earth, Ocean, and Atmospheric Sciences, Oregon State University, 104 CEOAS Admin Bldg., Corvallis, OR, 97331, USA.
College of Public Health and Human Sciences, Oregon State University, 160 SW 26th St., Corvallis, OR, 97331, USA.
Inj Epidemiol. 2021 Jul 5;8(1):26. doi: 10.1186/s40621-021-00323-z.
Commercial fishing consistently has among the highest workforce injury and fatality rates in the United States. Data related to commercial fishing incidents are routinely collected by multiple organizations which do not currently coordinate or automatically link data. Each data set has the potential to generate a more complete picture to inform prevention efforts. Our objective was to examine the utility of using statistical data linkage methods to link commercial fishing incident data when personally identifiable information is not available.
In this feasibility study, we identified true matches and discrepancies between de-identified data sets using the Python Record Linkage Toolkit. Four commercial fishing data sets from Oregon and Washington were linked: the Commercial Fishing Incident Database, the Vessel Casualty Database, the Nonfatal Injuries Database, and the Oregon Trauma Registry. The data sets each covered different date ranges within 2000-2017, containing 458, 524, 184, and 11 cases respectively. Several data linkage classifiers were evaluated.
The Naïve-Bayes classifier returned the highest number of true matches between these small data sets. A total of 41 true matches and 8 close matches were identified, of which 29 were determined to be duplicates. In addition, linkage highlighted 4 records that were not commercial fishing cases from Oregon and Washington. The optimum match parameters were the date, state, vessel official number, and number of people on board.
Statistical data linkage enables accurate, routine matching for small de-identified injury and fatality data sets such as those in commercial fishing. It provides information needed to improve the accuracy of existing data records. It also enables expanding and sharpening details of individual incidents in support of occupational safety research.
在美国,商业捕鱼业一直是工伤和死亡率最高的行业之一。多个组织定期收集与商业捕鱼事故相关的数据,但目前这些组织并未进行协调或自动链接数据。每个数据集都有可能生成更完整的情况,为预防工作提供信息。我们的目标是研究在无法获取个人身份识别信息的情况下,使用统计数据链接方法链接商业捕鱼事故数据的效用。
在这项可行性研究中,我们使用Python记录链接工具包识别了去识别化数据集之间的真实匹配和差异。链接了来自俄勒冈州和华盛顿州的四个商业捕鱼数据集:商业捕鱼事故数据库、船只伤亡数据库、非致命伤害数据库和俄勒冈州创伤登记处。这些数据集分别涵盖了2000 - 2017年的不同日期范围,分别包含458例、524例、184例和11例。评估了几种数据链接分类器。
朴素贝叶斯分类器在这些小数据集中返回的真实匹配数量最多。总共识别出41个真实匹配和8个近似匹配其中29个被确定为重复项。此外,链接突出显示了4条并非来自俄勒冈州和华盛顿州的商业捕鱼案例的记录。最佳匹配参数是日期、州、船只官方编号和船上人数。
统计数据链接能够对小型去识别化的工伤和死亡数据集(如商业捕鱼中的数据集)进行准确、常规的匹配。它提供了提高现有数据记录准确性所需的信息。它还能够扩展和细化个别事件的细节,以支持职业安全研究。