DiMB-RE：挖掘科学文献以寻找饮食与微生物组的关联。

DiMB-RE: mining the scientific literature for diet-microbiome associations.

作者信息

Hong Gibong, Hindle Veronica, Veasley Nadine M, Holscher Hannah D, Kilicoglu Halil

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL 61820, United States.

Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

出版信息

J Am Med Inform Assoc. 2025 Jun 1;32(6):998-1006. doi: 10.1093/jamia/ocaf054.

DOI:10.1093/jamia/ocaf054

PMID:40152137

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12089768/

Abstract

OBJECTIVES

To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies.

MATERIALS AND METHODS

We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (eg, Nutrient, Microorganism) and 13 relation types (eg, increases, improves) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked 2 generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings.

RESULTS

DiMB-RE consists of 14 450 entities and 4206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models.

DISCUSSION

To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. Natural language processing models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors.

CONCLUSION

DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.

摘要

目的

从生物医学文献中构建一个注释了饮食与微生物组关联的语料库，并训练自然语言处理（NLP）模型来识别这些关联，从而增进对它们在健康和疾病中作用的理解，并支持个性化营养策略。

材料与方法

我们构建了DiMB-RE，这是一个全面的语料库，注释了15种实体类型（如营养素、微生物）和13种关系类型（如增加、改善），用于捕捉饮食与微生物组的关联。我们使用DiMB-RE对最先进的NLP模型进行了微调，并评估了其在命名实体、触发词和关系提取以及事实性检测方面的性能。此外，我们在零样本和单样本设置下，在数据集的一个子集上对2个生成式大语言模型（GPT-4o-mini和GPT-4o）进行了基准测试。

结果

DiMB-RE由来自165篇出版物（包括30篇全文结果部分）的14450个实体和4206个关系组成。微调后的NLP模型在命名实体识别方面表现相当不错（F1分数为0.800），而端到端关系提取性能一般（F1为0.445）。使用结果部分的注释提高了关系提取性能。触发词检测的影响好坏参半。与微调模型相比，生成式模型的准确率较低。