Kim Tae-Yeon, Baek Seong-Uk, Lim Myeong-Hun, Yun Byungyoon, Paek Domyung, Zoh Kyung Ehi, Youn Kanwoo, Lee Yun Keun, Kim Yangho, Kim Jungwon, Choi Eunsuk, Kang Mo-Yeol, Cho YoonHo, Lee Kyung-Eun, Sim Juho, Oh Juyeon, Park Heejoo, Lee Jian, Won Jong-Uk, Lee Yu-Min, Yoon Jin-Ha
Department of Occupational and Environmental Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea.
The Institute for Occupational Health, Yonsei University College of Medicine, Seoul, Korea.
Ann Occup Environ Med. 2024 Aug 6;36:e19. doi: 10.35371/aoem.2024.36.e19. eCollection 2024.
Accurate occupation classification is essential in various fields, including policy development and epidemiological studies. This study aims to develop an occupation classification model based on DistilKoBERT.
This study used data from the 5th and 6th Korean Working Conditions Surveys conducted in 2017 and 2020, respectively. A total of 99,665 survey participants, who were nationally representative of Korean workers, were included. We used natural language responses regarding their job responsibilities and occupational codes based on the Korean Standard Classification of Occupations (7th version, 3-digit codes). The dataset was randomly split into training and test datasets in a ratio of 7:3. The occupation classification model based on DistilKoBERT was fine-tuned using the training dataset, and the model was evaluated using the test dataset. The accuracy, precision, recall, and F1 score were calculated as evaluation metrics.
The final model, which classified 28,996 survey participants in the test dataset into 142 occupational codes, exhibited an accuracy of 84.44%. For the evaluation metrics, the precision, recall, and F1 score of the model, calculated by weighting based on the sample size, were 0.83, 0.84, and 0.83, respectively. The model demonstrated high precision in the classification of service and sales workers yet exhibited low precision in the classification of managers. In addition, it displayed high precision in classifying occupations prominently represented in the training dataset.
This study developed an occupation classification system based on DistilKoBERT, which demonstrated reasonable performance. Despite further efforts to enhance the classification accuracy, this automated occupation classification model holds promise for advancing epidemiological studies in the fields of occupational safety and health.
准确的职业分类在包括政策制定和流行病学研究在内的各个领域都至关重要。本研究旨在开发一种基于DistilKoBERT的职业分类模型。
本研究使用了分别于2017年和2020年进行的第五次和第六次韩国工作条件调查的数据。共有99,665名调查参与者被纳入,他们在韩国工人中具有全国代表性。我们使用了基于《韩国职业标准分类》(第7版,3位代码)的关于工作职责和职业代码的自然语言回答。数据集以7:3的比例随机分为训练集和测试集。基于DistilKoBERT的职业分类模型使用训练集进行微调,并使用测试集对模型进行评估。计算准确率、精确率、召回率和F1分数作为评估指标。
最终模型将测试集中的28,996名调查参与者分类为142个职业代码,准确率为84.44%。对于评估指标,基于样本量加权计算的模型精确率、召回率和F1分数分别为0.83、0.84和0.83。该模型在服务和销售人员的分类中显示出高精度,但在管理人员的分类中显示出低精度。此外,它在对训练数据集中显著代表的职业进行分类时显示出高精度。
本研究开发了一种基于DistilKoBERT的职业分类系统,该系统表现出合理的性能。尽管为提高分类准确率还需进一步努力,但这种自动化职业分类模型有望推动职业安全与健康领域的流行病学研究。