Suppr超能文献

RegulaTome:科学文献中生物医学实体之间的有类型、有方向和有签名的关系语料库。

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.

机构信息

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark.

TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland.

出版信息

Database (Oxford). 2024 Sep 12;2024. doi: 10.1093/database/baae095.

Abstract

In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.

摘要

在生物医学文本挖掘领域,从文献中提取关系的能力对于推进理论研究和实际应用都至关重要。目前,用于增强多种关系提取的语料库非常缺乏,特别是针对蛋白质和包含蛋白质的实体(如复合物和家族)以及化学物质的关系提取。在这项工作中,我们提出了 RegulaTome,这是一个克服了几个现有的生物医学关系抽取(RE)语料库的局限性的语料库,其中许多语料库都集中在句子级别的单类型关系上。RegulaTome 的突出之处在于提供了 16961 种关系,这些关系在 >2500 篇文档中进行了标注,使其成为迄今为止此类数据集中规模最大的一个。这个语料库专门设计用于涵盖 >40 种传统上探索过的关系类型之外的更广泛的关系类型,为生物医学 RE 任务的复杂性和深度设定了新的基准。我们的语料库不仅拓宽了检测到的关系的范围,而且在 RE 中实现了值得注意的准确性。在这个语料库上训练的基于转换器的模型在这项复杂任务中表现出了有希望的 F1 分数(66.6%),这突出了我们的方法在准确识别和分类广泛的生物关系方面的有效性。这一成就表明 RegulaTome 有可能为开发更复杂、高效和准确的生物医学任务 RE 系统做出重大贡献。最后,在所有 PubMed 摘要和 PMC 开放获取全文文档上运行训练的 RE 系统,从整个生物医学文献中提取了超过 1800 万种关系。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/861b/11394941/d95f2f9c2c6e/baae095f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验