• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DiMB-RE:挖掘科学文献以寻找饮食与微生物组的关联。

DiMB-RE: mining the scientific literature for diet-microbiome associations.

作者信息

Hong Gibong, Hindle Veronica, Veasley Nadine M, Holscher Hannah D, Kilicoglu Halil

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL 61820, United States.

Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

出版信息

J Am Med Inform Assoc. 2025 Jun 1;32(6):998-1006. doi: 10.1093/jamia/ocaf054.

DOI:10.1093/jamia/ocaf054
PMID:40152137
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12089768/
Abstract

OBJECTIVES

To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies.

MATERIALS AND METHODS

We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (eg, Nutrient, Microorganism) and 13 relation types (eg, increases, improves) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked 2 generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings.

RESULTS

DiMB-RE consists of 14 450 entities and 4206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models.

DISCUSSION

To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. Natural language processing models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors.

CONCLUSION

DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.

摘要

目的

从生物医学文献中构建一个注释了饮食与微生物组关联的语料库,并训练自然语言处理(NLP)模型来识别这些关联,从而增进对它们在健康和疾病中作用的理解,并支持个性化营养策略。

材料与方法

我们构建了DiMB-RE,这是一个全面的语料库,注释了15种实体类型(如营养素、微生物)和13种关系类型(如增加、改善),用于捕捉饮食与微生物组的关联。我们使用DiMB-RE对最先进的NLP模型进行了微调,并评估了其在命名实体、触发词和关系提取以及事实性检测方面的性能。此外,我们在零样本和单样本设置下,在数据集的一个子集上对2个生成式大语言模型(GPT-4o-mini和GPT-4o)进行了基准测试。

结果

DiMB-RE由来自165篇出版物(包括30篇全文结果部分)的14450个实体和4206个关系组成。微调后的NLP模型在命名实体识别方面表现相当不错(F1分数为0.800),而端到端关系提取性能一般(F1为0.445)。使用结果部分的注释提高了关系提取性能。触发词检测的影响好坏参半。与微调模型相比,生成式模型的准确率较低。

讨论

据我们所知,DiMB-RE是专注于饮食与微生物组相互作用的最大且最多样化的语料库。与类似语料库相比,在DiMB-RE上微调的自然语言处理模型表现出较低的性能,凸显了该领域信息提取的复杂性。实体分类错误、触发词遗漏和跨句子关系是关系提取错误的主要来源。

结论

DiMB-RE可作为生物医学文献挖掘的基准语料库。DiMB-RE和NLP模型可在https://github.com/ScienceNLP-Lab/DiMB-RE获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/bcf2e23cd9cd/ocaf054f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/ad20549210ef/ocaf054f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/9a150c26e440/ocaf054f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/0966d8bf7da9/ocaf054f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/d53ee0441567/ocaf054f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/bcf2e23cd9cd/ocaf054f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/ad20549210ef/ocaf054f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/9a150c26e440/ocaf054f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/0966d8bf7da9/ocaf054f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/d53ee0441567/ocaf054f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63ce/12089768/bcf2e23cd9cd/ocaf054f5.jpg

相似文献

1
DiMB-RE: mining the scientific literature for diet-microbiome associations.DiMB-RE:挖掘科学文献以寻找饮食与微生物组的关联。
J Am Med Inform Assoc. 2025 Jun 1;32(6):998-1006. doi: 10.1093/jamia/ocaf054.
2
Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media.使用深度学习集成和微调大语言模型改进实体识别:以从VAERS和社交媒体中提取不良事件为例
J Biomed Inform. 2025 Mar;163:104789. doi: 10.1016/j.jbi.2025.104789. Epub 2025 Feb 7.
3
Assessing citation integrity in biomedical publications: corpus annotation and NLP models.评估生物医学出版物的引文完整性:语料库标注和自然语言处理模型。
Bioinformatics. 2024 Jul 1;40(7). doi: 10.1093/bioinformatics/btae420.
4
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别:开发与验证研究。
J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.
5
Ensemble pretrained language models to extract biomedical knowledge from literature.基于预训练语言模型的方法从文献中提取生物医学知识。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1904-1911. doi: 10.1093/jamia/ocae061.
6
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
7
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.深度学习架构在增强生物医学关系抽取中的应用:一种流水线方法。
Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.
8
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
9
CACER: Clinical concept Annotations for Cancer Events and Relations.CACER:癌症事件与关系的临床概念注释。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2583-2594. doi: 10.1093/jamia/ocae231.
10
Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.词汇很重要:用于酶命名实体识别的标注流水线和四个深度学习算法。
J Proteome Res. 2024 Jun 7;23(6):1915-1925. doi: 10.1021/acs.jproteome.3c00367. Epub 2024 May 11.

引用本文的文献

1
Harnessing the power of large language models for clinical tasks and synthesis of scientific literature.利用大语言模型的能力来完成临床任务和综合科学文献。
J Am Med Inform Assoc. 2025 Jun 1;32(6):983-984. doi: 10.1093/jamia/ocaf071.

本文引用的文献

1
Unity in Diversity: Collaborative Pre-training Across Multimodal Medical Sources.多元中的统一:跨多模态医学资源的协作式预训练
Proc Conf Assoc Comput Linguist Meet. 2024 Aug;2024(Volume 1 Long Papers):3644-3656. doi: 10.18653/v1/2024.acl-long.199.
2
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.深度学习架构在增强生物医学关系抽取中的应用:一种流水线方法。
Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.
3
The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.
生物创意 VIII 中生物医学关系提取数据集(BioRED)赛道概述。
Database (Oxford). 2024 Aug 8;2024. doi: 10.1093/database/baae069.
4
Semantics-enabled biomedical literature analytics.支持语义分析的生物医学文献分析
J Biomed Inform. 2024 Feb;150:104588. doi: 10.1016/j.jbi.2024.104588. Epub 2024 Jan 19.
5
BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.BioREx:通过利用异构数据集改进生物医学关系提取
J Biomed Inform. 2023 Oct;146:104487. doi: 10.1016/j.jbi.2023.104487. Epub 2023 Sep 4.
6
Leveraging pre-trained language models for mining microbiome-disease relationships.利用预训练语言模型挖掘微生物组-疾病关系。
BMC Bioinformatics. 2023 Jul 19;24(1):290. doi: 10.1186/s12859-023-05411-z.
7
An overview of biomedical entity linking throughout the years.生物医学实体链接概述。
J Biomed Inform. 2023 Jan;137:104252. doi: 10.1016/j.jbi.2022.104252. Epub 2022 Dec 2.
8
Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science.生物链接模型:临床、生物医学和转化科学中知识图谱的通用模式。
Clin Transl Sci. 2022 Aug;15(8):1848-1855. doi: 10.1111/cts.13302. Epub 2022 Jun 6.
9
Large scale text mining for deriving useful insights: A case study focused on microbiome.用于获取有用见解的大规模文本挖掘:以微生物组为重点的案例研究。
Front Physiol. 2022 Aug 31;13:933069. doi: 10.3389/fphys.2022.933069. eCollection 2022.
10
BioRED: a rich biomedical relation extraction dataset.BioRED:一个丰富的生物医学关系抽取数据集。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac282.