Nourani Esmaeil, Makri Evangelia-Mantelena, Mao Xiqing, Pyysalo Sampo, Brunak Søren, Nastou Katerina, Jensen Lars Juhl
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark.
Faculty of Information Technology and Computer Engineering, Azarbaijan Shahid Madani University, Tabriz, Iran.
Database (Oxford). 2025 Jan 13;2025. doi: 10.1093/database/baae129.
Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.
生活方式因素(LSFs)在疾病的发生和控制中发挥着越来越重要的作用,这一点已得到广泛认可。尽管它们很重要,但目前缺乏从文献中提取LSFs与疾病之间关系的方法,而这是将现有知识整合为结构化形式的必要步骤。由于基于简单共现的关系提取(RE)方法无法区分不同类型的LSF-疾病关系,因此需要诸如变压器之类的上下文感知模型来提取这些关系并将其分类为特定的关系类型。然而,当时不存在全面的LSF-疾病RE系统,也没有适合开发此类系统的语料库。我们展示了LSD600(可在https://zenodo.org/records/13952449获取),这是第一个专门为LSF-疾病RE设计的语料库,包含600篇摘要,其中在5027种疾病和6930个LSF实体之间存在1900种八种不同类型的关系。我们通过在该语料库上训练RoBERTa模型来评估LSD600的质量,在保留测试集上的多标签RE任务中获得了68.5%的F分数。我们还通过在两个营养-疾病和食物-疾病数据集上使用训练好的模型进一步验证了LSD600,在这两个数据集上它分别获得了70.7%和80.7%的F分数。基于这些性能结果,LSD600及其上训练的RE系统可以成为填补该领域现有空白并为下游应用铺平道路的宝贵资源。数据库网址:https://zenodo.org/records/13952449。