通过从生物医学摘要中进行多关系提取来扩展基于数据库的生物医学知识图谱。

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.

作者信息

Nicholson David N, Himmelstein Daniel S, Greene Casey S

机构信息

Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.

Department of Biomedical Informatics, University of Colorado School of Medicine and Center for Health Artificial Intellegence (CHAI), University of Colorado School of Medicine, Aurora, USA.

出版信息

BioData Min. 2022 Oct 18;15(1):26. doi: 10.1186/s13040-022-00311-z.

DOI:10.1186/s13040-022-00311-z

PMID:36258252

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9578183/

Abstract

BACKGROUND

Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.

RESULTS

We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.

CONCLUSIONS

Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.

摘要

背景

知识图谱通过为生物医学实体提供上下文信息、构建网络以及支持高通量分析的解释，来助力生物医学研究工作。这些数据库通过人工编目来填充，随着出版物数量呈指数级增长，这一过程在规模扩展方面具有挑战性。数据编程是一种范式，它通过将数据库与写成标签函数的简单规则和启发式方法相结合，规避了这一艰巨的人工过程，标签函数是用于自动注释文本数据的程序。不幸的是，编写一个有用的标签函数需要大量的错误分析，并且是一项 nontrivial 的任务，每个函数需要花费数天时间。这个瓶颈使得用多个节点和边类型填充知识图谱实际上变得不可行。因此，我们试图通过评估标签函数如何跨多种边类型重复使用来加速标签函数创建过程。

结果

我们获得了带有实体标签的摘要，并将这些实体进行子集化处理，使其仅包含化合物、基因和疾病提及。我们提取了包含先前描述的知识图谱 Hetionet v1 中某些生物医学实体共同提及的句子。我们训练了一个仅使用基于数据库的标签函数的基线模型，然后使用一种采样方法来衡量添加特定于边或边不匹配的标签函数组合相对于我们的基线有多大程度的改进。接下来，我们训练了一个判别模型来检测表明生物医学关系的句子，然后估计可以召回并添加到 Hetionet v1 的边类型数量。我们发现添加边不匹配的标签函数很少能改善关系提取，而控制特定于边的标签函数则可以。这种趋势有两个例外，即“化合物 - 结合 - 基因”和“基因 - 相互作用 - 基因”，它们都表明了物理关系并且显示出可转移性的迹象。在测试的各种场景中，判别模型的性能强烈依赖于生成的注释。使用针对每种边类型的最佳判别模型，我们在 Hetionet v1 中召回了近 30% 的已建立边。

结论

我们的结果表明，这个框架可以将新的边纳入我们的源知识图谱。然而，标签函数转移的结果好坏参半。只有描述非常相似边类型的标签函数在转移时支持性能提升。我们预计，这一策略的持续发展可能为用发现结果填充生物医学知识图谱提供必要的构建模块，确保这些资源包含前沿成果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9128/9578183/142a6446a3a8/13040_2022_311_Fig1_HTML.jpg

相似文献

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.通过从生物医学摘要中进行多关系提取来扩展基于数据库的生物医学知识图谱。

BioData Min. 2022 Oct 18;15(1):26. doi: 10.1186/s13040-022-00311-z.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Large-Scale Biomedical Relation Extraction Across Diverse Relation Types: Model Development and Usability Study on COVID-19.大规模生物医学关系抽取跨越多种关系类型：COVID-19 的模型开发和可用性研究。

J Med Internet Res. 2023 Sep 20;25:e48115. doi: 10.2196/48115.

edge2vec: Representation learning using edge semantics for biomedical knowledge discovery.边向量模型：利用边语义的表示学习方法进行生物医学知识发现。

BMC Bioinformatics. 2019 Jun 10;20(1):306. doi: 10.1186/s12859-019-2914-2.

Task-driven knowledge graph filtering improves prioritizing drugs for repurposing.任务驱动的知识图过滤可改善药物再利用的优先级排序。

BMC Bioinformatics. 2022 Mar 4;23(1):84. doi: 10.1186/s12859-022-04608-y.

Exploiting graph kernels for high performance biomedical relation extraction.利用图核进行高性能生物医学关系提取。

J Biomed Semantics. 2018 Jan 30;9(1):7. doi: 10.1186/s13326-017-0168-3.

GENA: A knowledge graph for nutrition and mental health.GENA：一个营养与心理健康的知识图谱。

J Biomed Inform. 2023 Sep;145:104460. doi: 10.1016/j.jbi.2023.104460. Epub 2023 Aug 1.

Enhancing cross-evidence reasoning graph for document-level relation extraction.增强用于文档级关系抽取的交叉证据推理图

PeerJ Comput Sci. 2024 Jun 17;10:e2123. doi: 10.7717/peerj-cs.2123. eCollection 2024.

HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+：利用异构知识资源丰富人类表型本体的节点嵌入。

J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.

Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation.通过标签再校正和知识蒸馏提高生物医学命名实体识别的召回率。

BMC Bioinformatics. 2021 Jun 2;22(1):295. doi: 10.1186/s12859-021-04200-w.

引用本文的文献

Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models.使用共现和转换器模型，串行 KinderMiner (SKiM) 发现和注释生物医学知识。

BMC Bioinformatics. 2023 Nov 1;24(1):412. doi: 10.1186/s12859-023-05539-y.

Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls.自然语言处理在药物发现知识图谱中的应用：前景与挑战。

Methods Mol Biol. 2024;2716:223-240. doi: 10.1007/978-1-0716-3449-3_10.

CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations.CROssBAR：具有知识图谱表示的生物医学关系的综合资源。

Nucleic Acids Res. 2021 Sep 20;49(16):e96. doi: 10.1093/nar/gkab543.

Topological network measures for drug repositioning.拓扑网络度量在药物重定位中的应用。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa357.

Constructing knowledge graphs and their biomedical applications.构建知识图谱及其生物医学应用。

Comput Struct Biotechnol J. 2020 Jun 2;18:1414-1428. doi: 10.1016/j.csbj.2020.05.017. eCollection 2020.

本文引用的文献

Snorkel: rapid training data creation with weak supervision.Snorkel：通过弱监督快速创建训练数据。

VLDB J. 2020;29(2):709-730. doi: 10.1007/s00778-019-00552-1. Epub 2019 Jul 15.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision.CoCoScore：用于基于远程监督的文本挖掘应用的上下文感知共现评分法

Bioinformatics. 2020 Jan 1;36(1):264-271. doi: 10.1093/bioinformatics/btz490.

PubTator central: automated concept annotation for biomedical full text articles.PubTator 中心：用于生物医学全文文章的自动概念标注。

Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.

Knowledge-guided convolutional networks for chemical-disease relation extraction.知识引导的卷积神经网络用于化学-疾病关系抽取。

BMC Bioinformatics. 2019 May 21;20(1):260. doi: 10.1186/s12859-019-2873-7.

Snorkel MeTaL: Weak Supervision for Multi-Task Learning.Snorkel MeTaL：多任务学习的弱监督

Proc Second Workshop Data Manag End End Mach Learn (2018). 2018 Jun;2018. doi: 10.1145/3209889.3209898.

Analyzing a co-occurrence gene-interaction network to identify disease-gene association.分析共发生基因相互作用网络以识别疾病-基因关联。

BMC Bioinformatics. 2019 Feb 8;20(1):70. doi: 10.1186/s12859-019-2634-7.

Extracting chemical-protein relations using attention-based neural networks.基于注意力机制神经网络的化学-蛋白质关系抽取。

Database (Oxford). 2018 Jan 1;2018:bay102. doi: 10.1093/database/bay102.

Automatic extraction of gene-disease associations from literature using joint ensemble learning.利用联合集成学习从文献中自动提取基因-疾病关联。

PLoS One. 2018 Jul 26;13(7):e0200699. doi: 10.1371/journal.pone.0200699. eCollection 2018.

Chemical-gene relation extraction using recursive neural network.基于递归神经网络的化学-基因关系抽取。

Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay060.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过从生物医学摘要中进行多关系提取来扩展基于数据库的生物医学知识图谱。

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献