Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark.
Faculty of Information Technology and Computer Engineering, Azarbaijan Shahid Madani University, Tabriz, Iran.
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae613.
Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER.
We present a novel lifestyle factor ontology (LSFO), which we used to develop a dictionary-based system for recognition and normalization of LSFs. Additionally, we introduce a manually annotated corpus for LSFs (LSF200) suitable for training and evaluation of NER systems, and use it to train a transformer-based system. Evaluating the performance of both NER systems on the corpus revealed an F-score of 64% for the dictionary-based system and 76% for the transformer-based system. Large-scale application of these systems on PubMed abstracts and PMC Open Access articles identified over 300 million mentions of LSF in the biomedical literature.
LSFO, the annotated LSF200 corpus, and the detected LSFs in PubMed and PMC-OA articles using both NER systems, are available under open licenses via the following GitHub repository: https://github.com/EsmaeilNourani/LSFO-expansion. This repository contains links to two associated GitHub repositories and a Zenodo project related to the study. LSFO is also available at BioPortal: https://bioportal.bioontology.org/ontologies/LSFO.
尽管生活方式因素(LSFs)越来越被认为是塑造个体健康轨迹的因素,尤其是在慢性病方面,但它们在生物医学文献中仍然没有得到系统的描述。这在一定程度上是因为没有命名实体识别(NER)系统能够全面检测文本中的所有类型的 LSF。由于其固有多样性、基于字典的 NER 缺乏全面的 LSF 分类以及缺乏基于深度学习的 NER 的语料库,因此这项任务具有挑战性。
我们提出了一种新颖的生活方式因素本体(LSFO),我们使用它来开发基于字典的系统,用于识别和规范化 LSF。此外,我们引入了一个手动注释的 LSF 语料库(LSF200),适合用于 NER 系统的培训和评估,并使用它来训练基于转换器的系统。在语料库上评估这两个 NER 系统的性能,基于字典的系统的 F 分数为 64%,基于转换器的系统的 F 分数为 76%。这些系统在 PubMed 摘要和 PMC 开放获取文章上的大规模应用,在生物医学文献中识别出了超过 3 亿个 LSF 的提及。
LSFO、带注释的 LSF200 语料库以及使用这两个 NER 系统在 PubMed 和 PMC-OA 文章中检测到的 LSF,都可以通过以下 GitHub 存储库以开放许可证获得:https://github.com/EsmaeilNourani/LSFO-expansion。该存储库包含与两个相关的 GitHub 存储库以及与该研究相关的 Zenodo 项目的链接。LSFO 也可在 BioPortal 上获得:https://bioportal.bioontology.org/ontologies/LSFO。