Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, United States.
Data Science and Engineering Research Group, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, United States.
Ann Work Expo Health. 2023 Jul 6;67(6):772-783. doi: 10.1093/annweh/wxad020.
OBJECTIVES: Computer-assisted coding of job descriptions to standardized occupational classification codes facilitates evaluating occupational risk factors in epidemiologic studies by reducing the number of jobs needing expert coding. We evaluated the performance of the 2nd version of SOCcer, a computerized algorithm designed to code free-text job descriptions to US SOC-2010 system based on free-text job titles and work tasks, to evaluate its accuracy. METHODS: SOCcer v2 was updated by expanding the training data to include jobs from several epidemiologic studies and revising the algorithm to account for nonlinearity and incorporate interactions. We evaluated the agreement between codes assigned by experts and the highest scoring code (a measure of confidence in the algorithm-predicted assignment) from SOCcer v1 and v2 in 14,714 jobs from three epidemiology studies. We also linked exposure estimates for 258 agents in the job-exposure matrix CANJEM to the expert and SOCcer v2-assigned codes and compared those estimates using kappa and intraclass correlation coefficients. Analyses were stratified by SOCcer score, score distance between the top two scoring codes from SOCcer, and features from CANJEM. RESULTS: SOCcer's v2 agreement at the 6-digit level was 50%, compared to 44% in v1, and was similar for the three studies (38%-45%). Overall agreement for v2 at the 2-, 3-, and 5-digit was 73%, 63%, and 56%, respectively. For v2, median ICCs for the probability and intensity metrics were 0.67 (IQR 0.59-0.74) and 0.56 (IQR 0.50-0.60), respectively. The agreement between the expert and SOCcer assigned codes linearly increased with SOCcer score. The agreement also improved when the top two scoring codes had larger differences in score. CONCLUSIONS: Overall agreement with SOCcer v2 applied to job descriptions from North American epidemiologic studies was similar to the agreement usually observed between two experts. SOCcer's score predicted agreement with experts and can be used to prioritize jobs for expert review.
目的:通过减少需要专家编码的工作数量,将工作描述计算机辅助编码为标准化职业分类代码,便于在流行病学研究中评估职业危险因素。我们评估了 2 版 SOCcer 的性能,该算法是一种基于工作标题和工作任务将自由文本工作描述编码为美国 SOC-2010 系统的计算机算法,以评估其准确性。
方法:通过扩大训练数据,包括来自几项流行病学研究的工作,以及修改算法以考虑非线性和纳入交互作用,更新了 SOCcer v2。我们评估了专家分配的代码与 SOCcer v1 和 v2 的最高分代码(算法预测分配的置信度度量)之间的一致性,该一致性来自三个流行病学研究的 14714 个工作。我们还将职业暴露矩阵 CANJEM 中 258 种暴露剂的暴露估计值与专家和 SOCcer v2 分配的代码联系起来,并使用kappa 和组内相关系数比较了这些估计值。分析按 SOCcer 得分、SOCcer 得分最高的两个代码之间的得分距离以及 CANJEM 的特征进行分层。
结果:SOCcer v2 在 6 位数水平上的一致性为 50%,而 v1 为 44%,三个研究(38%-45%)相似。v2 在 2 位数、3 位数和 5 位数的总体一致性分别为 73%、63%和 56%。对于 v2,概率和强度指标的中位数 ICC 分别为 0.67(IQR 0.59-0.74)和 0.56(IQR 0.50-0.60)。专家与 SOCcer 分配代码之间的一致性随着 SOCcer 得分线性增加。当得分最高的两个代码之间的差异较大时,一致性也会提高。
结论:应用于来自北美的流行病学研究的工作描述的 SOCcer v2 与通常在两位专家之间观察到的一致性相似。SOCcer 的分数可预测与专家的一致性,可用于为专家审查分配工作。
Ann Work Expo Health. 2023-7-6
Ann Work Expo Health. 2022-6-6
Ann Occup Hyg. 2016-8
Environ Health Perspect. 2023-10
Ann Work Expo Health. 2018-8-13
J Occup Environ Med. 2018-7
Occup Environ Med. 2017-3