小心灰熊人：使用手动编码和 NIOSH NIOCCS 机器学习算法比较基于工作和行业的噪声暴露估计。

Beware the Grizzlyman: A comparison of job- and industry-based noise exposure estimates using manual coding and the NIOSH NIOCCS machine learning algorithm.

机构信息

Cardno ChemRisk, Chicago, Illinois.

Department of Environmental Health Sciences, University of Michigan School of Public Health, Ann Arbor, Michigan.

出版信息

J Occup Environ Hyg. 2022 Jul;19(7):437-447. doi: 10.1080/15459624.2022.2076860. Epub 2022 Jun 7.

DOI:10.1080/15459624.2022.2076860

PMID:35537195

Abstract

Recently, the National Institute for Occupational Safety and Health (NIOSH) released an updated version of the NIOSH Industry and Occupation Computerized Coding System (NIOCCS), which uses supervised machine learning to assign industry and occupational codes based on provided free-text information. However, no efforts have been made to externally verify the quality of assigned industry and job titles when the algorithm is provided with inputs of varying quality. This study sought to evaluate whether the NIOCCS algorithm was sufficiently robust with low-quality inputs and how variable quality could impact subsequent job estimated exposures in a large job-exposure matrix for noise (NoiseJEM). Using free-text industry and job descriptions from >700,000 noise measurements in the NoiseJEM, three files were created and input into NIOCCS: (1) N1, "raw" industries and job titles; (2) N2, "refined" industries and "raw" job titles; and (3) N3, "refined" industries and job titles. Standardized industry and occupation codes were output by NIOCCS. Descriptive statistics of performance metrics (e.g., misclassification/discordance of occupation codes) were evaluated for each input relative to the original NoiseJEM dataset (N0). Across major Standardized Occupational Classifications (SOC), total discordance rates for N1, N2, and N3 compared to N0 were 53.6%, 42.3%, and 5.0%, respectively. The impact of discordance on the major SOC group varied and included both over- and under-estimates of average noise exposure compared to N0. N2 had the most accurate noise exposure estimates (i.e., smallest bias) across major SOC groups compared to N1 and N3. Further refinement of job titles in N3 showed little improvement. Some variation in classification efficacy was seen over time, particularly prior to 1985. Machine learning algorithms can systematically and consistently classify data but are highly dependent on the quality and amount of input data. The greatest benefit for an end-user may come from cleaning industry information before applying this method for job classification. Our results highlight the need for standardized classification methods that remain constant over time.

摘要

最近，美国职业安全与健康研究所（NIOSH）发布了 NIOSH 行业和职业计算机编码系统（NIOCCS）的更新版本，该系统使用有监督的机器学习，根据提供的自由文本信息分配行业和职业代码。然而，当算法提供输入质量不同时，没有努力对外验证分配的行业和职位的质量。本研究旨在评估 NIOCCS 算法在低质量输入时是否足够稳健，以及可变质量如何影响噪声大型职业暴露矩阵（NoiseJEM）中的后续职业估计暴露。使用 NoiseJEM 中超过 70 万次噪声测量的自由文本行业和工作描述，创建了三个文件并输入到 NIOCCS 中：（1）N1，“原始”行业和工作标题；（2）N2，“精炼”行业和“原始”工作标题；（3）N3，“精炼”行业和工作标题。NIOSCCS 输出标准化的行业和职业代码。相对于原始 NoiseJEM 数据集（N0），评估了每个输入的性能指标（例如职业代码的分类错误/不一致）的描述性统计数据。在主要标准职业分类（SOC）中，与 N0 相比，N1、N2 和 N3 的总不一致率分别为 53.6%、42.3%和 5.0%。不一致对主要 SOC 群体的影响各不相同，包括与 N0 相比，噪声暴露的高估和低估。与 N1 和 N3 相比，N2 在主要 SOC 群体中具有最准确的噪声暴露估计值（即最小偏差）。在 N3 中进一步细化工作标题几乎没有改善。随着时间的推移，分类效果存在一些变化，尤其是在 1985 年之前。机器学习算法可以系统地、一致地对数据进行分类，但高度依赖输入数据的质量和数量。对于最终用户来说，最大的好处可能是在应用这种方法进行工作分类之前清理行业信息。我们的结果强调了标准化分类方法的必要性，这些方法应随着时间的推移保持不变。