• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

从科学文献中大规模评估 NLP 衍生的化学-基因/蛋白质关系:对知识图谱构建的影响。

A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction.

机构信息

Evotec (UK) Ltd., in silico Research and Development, Milton Park, Abingdon, Oxfordshire, United Kingdom.

出版信息

PLoS One. 2023 Sep 8;18(9):e0291142. doi: 10.1371/journal.pone.0291142. eCollection 2023.

DOI:10.1371/journal.pone.0291142
PMID:37682956
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10490933/
Abstract

One area of active research is the use of natural language processing (NLP) to mine biomedical texts for sets of triples (subject-predicate-object) for knowledge graph (KG) construction. While statistical methods to mine co-occurrences of entities within sentences are relatively robust, accurate relationship extraction is more challenging. Herein, we evaluate the Global Network of Biomedical Relationships (GNBR), a dataset that uses distributional semantics to model relationships between biomedical entities. The focus of our paper is an evaluation of a subset of the GNBR data; the relationships between chemicals and genes/proteins. We use Evotec's structured 'Nexus' database of >2.76M chemical-protein interactions as a ground truth to compare with GNBRs relationships and find a micro-averaged precision-recall area under the curve (AUC) of 0.50 and a micro-averaged receiver operating characteristic (ROC) curve AUC of 0.71 across the relationship classes 'inhibits', 'binding', 'agonism' and 'antagonism', when a comparison is made on a sentence-by-sentence basis. We conclude that, even though these micro-average scores are modest, using a high threshold on certain relationship classes like 'inhibits' could yield high fidelity triples that are not reported in structured datasets. We discuss how different methods of processing GNBR data, and the factuality of triples could affect the accuracy of NLP data incorporated into knowledge graphs. We provide a GNBR-Nexus(ChEMBL-subset) merged datafile that contains over 20,000 sentences where a protein/gene-chemical co-occur and includes both the GNBR relationship scores as well as the ChEMBL (manually curated) relationships (e.g., 'agonist', 'inhibitor') -this can be accessed at https://doi.org/10.5281/zenodo.8136752. We envisage this being used to aid curation efforts by the drug discovery community.

摘要

一个活跃的研究领域是使用自然语言处理 (NLP) 从生物医学文本中挖掘用于知识图谱 (KG) 构建的三元组集(主语-谓语-宾语)。虽然用于挖掘句子中实体共现的统计方法相对稳健,但准确的关系提取更具挑战性。在此,我们评估了全球生物医学关系网络 (GNBR),这是一个使用分布式语义模型来模拟生物医学实体之间关系的数据集。我们论文的重点是评估 GNBR 数据的一个子集;化学物质和基因/蛋白质之间的关系。我们使用 Evotec 的结构化'Nexus'数据库 (>2760 万种化学-蛋白质相互作用) 作为基准来与 GNBR 关系进行比较,并发现“抑制”、“结合”、“激动剂”和“拮抗剂”这四个关系类别的微平均精度-召回曲线下面积 (AUC) 为 0.50,微平均接收器操作特征 (ROC) 曲线 AUC 为 0.71,当在句子对句子的基础上进行比较时。我们得出结论,即使这些微平均分数适中,在某些关系类(如“抑制”)上使用较高的阈值也可以生成未在结构化数据集中报告的高保真三元组。我们讨论了处理 GNBR 数据的不同方法以及三元组的真实性如何影响纳入知识图谱的 NLP 数据的准确性。我们提供了一个 GNBR-Nexus(ChEMBL 子集)合并数据集,其中包含超过 20000 个句子,其中蛋白质/基因-化学物质同时出现,并包含 GNBR 关系分数以及 ChEMBL(人工 curated)关系(例如“激动剂”、“抑制剂”)-可在 https://doi.org/10.5281/zenodo.8136752 访问。我们设想这将有助于药物发现社区的策展工作。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/491e797d6b6a/pone.0291142.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/cb086f7f9a82/pone.0291142.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/7c7490bded67/pone.0291142.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/d783848d0e62/pone.0291142.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/43b3334ce950/pone.0291142.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/24d07671e907/pone.0291142.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/491e797d6b6a/pone.0291142.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/cb086f7f9a82/pone.0291142.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/7c7490bded67/pone.0291142.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/d783848d0e62/pone.0291142.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/43b3334ce950/pone.0291142.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/24d07671e907/pone.0291142.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ba4/10490933/491e797d6b6a/pone.0291142.g006.jpg

相似文献

1
A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction.从科学文献中大规模评估 NLP 衍生的化学-基因/蛋白质关系:对知识图谱构建的影响。
PLoS One. 2023 Sep 8;18(9):e0291142. doi: 10.1371/journal.pone.0291142. eCollection 2023.
2
Exploiting semantic patterns over biomedical knowledge graphs for predicting treatment and causative relations.利用生物医学知识图谱中的语义模式预测治疗和因果关系。
J Biomed Inform. 2018 Jun;82:189-199. doi: 10.1016/j.jbi.2018.05.003. Epub 2018 May 12.
3
Developing a Knowledge Graph for Pharmacokinetic Natural Product-Drug Interactions.开发药代动力学天然产物-药物相互作用知识库。
J Biomed Inform. 2023 Apr;140:104341. doi: 10.1016/j.jbi.2023.104341. Epub 2023 Mar 17.
4
Natural Language Processing for Drug Discovery Knowledge Graphs: Promises and Pitfalls.自然语言处理在药物发现知识图谱中的应用:前景与挑战。
Methods Mol Biol. 2024;2716:223-240. doi: 10.1007/978-1-0716-3449-3_10.
5
Enhancing the coverage of SemRep using a relation classification approach.利用关系分类方法增强 SemRep 的覆盖范围。
J Biomed Inform. 2024 Jul;155:104658. doi: 10.1016/j.jbi.2024.104658. Epub 2024 May 21.
6
Automatic extraction of protein-protein interactions using grammatical relationship graph.基于语法关系图自动提取蛋白质相互作用。
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):42. doi: 10.1186/s12911-018-0628-4.
7
Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations.DrugProt 任务概述在 BioCreative VII 上:大规模文本挖掘和异构化学-蛋白质关系知识图生成的数据和方法。
Database (Oxford). 2023 Nov 28;2023. doi: 10.1093/database/baad080.
8
A global network of biomedical relationships derived from text.从文本中提取的生物医学关系的全球网络。
Bioinformatics. 2018 Aug 1;34(15):2614-2624. doi: 10.1093/bioinformatics/bty114.
9
KGen: a knowledge graph generator from biomedical scientific literature.KGen:一种从生物医学科学文献中生成知识图谱的工具。
BMC Med Inform Decis Mak. 2020 Dec 14;20(Suppl 4):314. doi: 10.1186/s12911-020-01341-5.
10
Identification of pharmacodynamic biomarker hypotheses through literature analysis with IBM Watson.通过 IBM Watson 进行文献分析识别药效生物标志物假说。
PLoS One. 2019 Apr 8;14(4):e0214619. doi: 10.1371/journal.pone.0214619. eCollection 2019.

引用本文的文献

1
A Metric for the Entropic Purpose of a System.一种用于衡量系统熵目标的指标。
Entropy (Basel). 2025 Jan 26;27(2):131. doi: 10.3390/e27020131.

本文引用的文献

1
A review of biomedical datasets relating to drug discovery: a knowledge graph perspective.生物医学数据集在药物发现中的应用综述:知识图谱视角。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac404.
2
CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations.CROssBAR:具有知识图谱表示的生物医学关系的综合资源。
Nucleic Acids Res. 2021 Sep 20;49(16):e96. doi: 10.1093/nar/gkab543.
3
Drug repurposing for COVID-19 via knowledge graph completion.基于知识图谱补全的新冠病毒药物再利用
J Biomed Inform. 2021 Mar;115:103696. doi: 10.1016/j.jbi.2021.103696. Epub 2021 Feb 8.
4
Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs.基于异质图张量分解预测的治疗靶点的临床前验证。
Sci Rep. 2020 Oct 26;10(1):18250. doi: 10.1038/s41598-020-74922-z.
5
Repurpose Open Data to Discover Therapeutics for COVID-19 Using Deep Learning.利用深度学习重新利用公开数据发现 COVID-19 治疗方法。
J Proteome Res. 2020 Nov 6;19(11):4624-4636. doi: 10.1021/acs.jproteome.0c00316. Epub 2020 Jul 24.
6
Constructing knowledge graphs and their biomedical applications.构建知识图谱及其生物医学应用。
Comput Struct Biotechnol J. 2020 Jun 2;18:1414-1428. doi: 10.1016/j.csbj.2020.05.017. eCollection 2020.
7
Broad-coverage biomedical relation extraction with SemRep.基于 SemRep 的广谱生物医学关系抽取。
BMC Bioinformatics. 2020 May 14;21(1):188. doi: 10.1186/s12859-020-3517-7.
8
A Literature-Based Knowledge Graph Embedding Method for Identifying Drug Repurposing Opportunities in Rare Diseases.基于文献的知识图嵌入方法用于识别罕见病中的药物再利用机会。
Pac Symp Biocomput. 2020;25:463-474.
9
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
10
PubTator central: automated concept annotation for biomedical full text articles.PubTator 中心:用于生物医学全文文章的自动概念标注。
Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.