Suppr超能文献

DiMB-RE:挖掘科学文献以寻找饮食与微生物组的关联。

DiMB-RE: mining the scientific literature for diet-microbiome associations.

作者信息

Hong Gibong, Hindle Veronica, Veasley Nadine M, Holscher Hannah D, Kilicoglu Halil

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL 61820, United States.

Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

出版信息

J Am Med Inform Assoc. 2025 Jun 1;32(6):998-1006. doi: 10.1093/jamia/ocaf054.

Abstract

OBJECTIVES

To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies.

MATERIALS AND METHODS

We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (eg, Nutrient, Microorganism) and 13 relation types (eg, increases, improves) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked 2 generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings.

RESULTS

DiMB-RE consists of 14 450 entities and 4206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models.

DISCUSSION

To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. Natural language processing models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors.

CONCLUSION

DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.

摘要

目的

从生物医学文献中构建一个注释了饮食与微生物组关联的语料库,并训练自然语言处理(NLP)模型来识别这些关联,从而增进对它们在健康和疾病中作用的理解,并支持个性化营养策略。

材料与方法

我们构建了DiMB-RE,这是一个全面的语料库,注释了15种实体类型(如营养素、微生物)和13种关系类型(如增加、改善),用于捕捉饮食与微生物组的关联。我们使用DiMB-RE对最先进的NLP模型进行了微调,并评估了其在命名实体、触发词和关系提取以及事实性检测方面的性能。此外,我们在零样本和单样本设置下,在数据集的一个子集上对2个生成式大语言模型(GPT-4o-mini和GPT-4o)进行了基准测试。

结果

DiMB-RE由来自165篇出版物(包括30篇全文结果部分)的14450个实体和4206个关系组成。微调后的NLP模型在命名实体识别方面表现相当不错(F1分数为0.800),而端到端关系提取性能一般(F1为0.445)。使用结果部分的注释提高了关系提取性能。触发词检测的影响好坏参半。与微调模型相比,生成式模型的准确率较低。

讨论

据我们所知,DiMB-RE是专注于饮食与微生物组相互作用的最大且最多样化的语料库。与类似语料库相比,在DiMB-RE上微调的自然语言处理模型表现出较低的性能,凸显了该领域信息提取的复杂性。实体分类错误、触发词遗漏和跨句子关系是关系提取错误的主要来源。

结论

DiMB-RE可作为生物医学文献挖掘的基准语料库。DiMB-RE和NLP模型可在https://github.com/ScienceNLP-Lab/DiMB-RE获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/ad20549210ef/ocaf054f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验