Department of Computer Science, National Centre for Text Mining, University of Manchester, Manchester, United Kingdom.
Centre for Occupational and Environmental Health, School of Health Sciences, University of Manchester, Manchester, United Kingdom.
PLoS One. 2024 Aug 15;19(8):e0307844. doi: 10.1371/journal.pone.0307844. eCollection 2024.
An individual's likelihood of developing non-communicable diseases is often influenced by the types, intensities and duration of exposures at work. Job exposure matrices provide exposure estimates associated with different occupations. However, due to their time-consuming expert curation process, job exposure matrices currently cover only a subset of possible workplace exposures and may not be regularly updated. Scientific literature articles describing exposure studies provide important supporting evidence for developing and updating job exposure matrices, since they report on exposures in a variety of occupational scenarios. However, the constant growth of scientific literature is increasing the challenges of efficiently identifying relevant articles and important content within them. Natural language processing methods emulate the human process of reading and understanding texts, but in a fraction of the time. Such methods can increase the efficiency of both finding relevant documents and pinpointing specific information within them, which could streamline the process of developing and updating job exposure matrices. Named entity recognition is a fundamental natural language processing method for language understanding, which automatically identifies mentions of domain-specific concepts (named entities) in documents, e.g., exposures, occupations and job tasks. State-of-the-art machine learning models typically use evidence from an annotated corpus, i.e., a set of documents in which named entities are manually marked up (annotated) by experts, to learn how to detect named entities automatically in new documents. We have developed a novel annotated corpus of scientific articles to support machine learning based named entity recognition relevant to occupational substance exposures. Through incremental refinements to the annotation process, we demonstrate that expert annotators can attain high levels of agreement, and that the corpus can be used to train high-performance named entity recognition models. The corpus thus constitutes an important foundation for the wider development of natural language processing tools to support the study of occupational exposures.
个体患非传染性疾病的可能性通常受工作中接触类型、强度和持续时间的影响。职业暴露矩阵提供与不同职业相关的暴露估计。然而,由于其耗时的专家编辑过程,职业暴露矩阵目前仅涵盖部分可能的工作场所暴露,并且可能不会定期更新。描述暴露研究的科学文献文章为开发和更新职业暴露矩阵提供了重要的支持证据,因为它们报告了各种职业场景中的暴露情况。然而,科学文献的不断增长增加了有效识别相关文章和其中重要内容的挑战。自然语言处理方法模拟人类阅读和理解文本的过程,但时间要短得多。这些方法可以提高发现相关文档和其中特定信息的效率,从而简化开发和更新职业暴露矩阵的过程。命名实体识别是语言理解的一种基本自然语言处理方法,它自动识别文档中特定领域概念(命名实体)的提及,例如暴露、职业和工作任务。最先进的机器学习模型通常使用来自注释语料库的证据,即一组由专家手动标记(注释)命名实体的文档,以学习如何在新文档中自动检测命名实体。我们开发了一个新的科学文章注释语料库,以支持基于机器学习的职业物质暴露相关的命名实体识别。通过对注释过程的逐步改进,我们证明了专家注释者可以达到高度一致的水平,并且该语料库可用于训练高性能命名实体识别模型。因此,该语料库构成了开发支持职业暴露研究的自然语言处理工具的重要基础。