深度学习与本体论相遇：将心血管疾病本体论锚定在生物医学文献中的实验。

Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature.

作者信息

Arguello Casteleiro Mercedes, Demetriou George, Read Warren, Fernandez Prieto Maria Jesus, Maroto Nava, Maseda Fernandez Diego, Nenadic Goran, Klein Julie, Keane John, Stevens Robert

机构信息

School of Computer Science, University of Manchester, Manchester, UK.

Salford Languages, University of Salford, Salford, UK.

出版信息

J Biomed Semantics. 2018 Apr 12;9(1):13. doi: 10.1186/s13326-018-0181-1.

DOI:10.1186/s13326-018-0181-1

PMID:29650041

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5896136/

Abstract

BACKGROUND

Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created.

METHODS

We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels.

RESULTS

In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%.

CONCLUSIONS

This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature.

摘要

背景

从数百万篇生物医学出版物中自动识别基因和蛋白质名称的术语变体或可接受的替代自由文本术语是一项具有挑战性的任务。本体，如心血管疾病本体（CVDO），以计算形式捕获领域知识，并可为文献中所写的基因/蛋白质名称提供上下文。本研究调查：1）深度学习算法中的词嵌入能否为给定的感兴趣基因/蛋白质提供术语变体列表；2）CVDO中的生物知识能否在不修改所创建词嵌入的情况下改进这样的列表。

方法

我们从25篇PubMed标题/摘要中手动注释了105个基因/蛋白质名称，并将它们映射到79个唯一的UniProtKB条目，这些条目对应于CVDO中的基因和蛋白质类别。使用超过1400万篇PubMed文章（标题和可用摘要），通过连续词袋模型（CBOW）和跳字模型（Skip-gram）生成词嵌入。我们针对同义词检测任务设置了两个实验，每个实验有四名评分者，以及从所创建的词嵌入中选取的3672对术语（目标术语和候选术语）。对于实验I，64个UniProtKB条目的目标术语是那些出现在标题/摘要中的术语；实验II涉及63个UniProtKB条目，目标术语是来自PubMed标题/摘要的术语与来自CVDO蛋白质类别表达和标签的术语（即增加的上下文）的组合。

结果

在实验I中，跳字模型为64个UniProtKB条目中的89%找到了术语变体（完整和/或部分），而连续词袋模型为67%找到了术语变体。在实验II（借助CVDO）中，跳字模型为63个UniProtKB条目中的95%找到了术语变体，而连续词袋模型为78%找到了术语变体。综合两个实验的结果，跳字模型为79个UniProtKB条目中的97%找到了术语变体，而连续词袋模型为81%找到了术语变体。

结论

本研究表明，通过添加CVDO中形式化的知识且不修改所创建的词嵌入，连续词袋模型和跳字模型在基因/蛋白质同义词检测任务上的性能都有所提高。因此，CVDO提供的上下文对于连续词袋模型和跳字模型在诱导术语变异性同时减少歧义方面是有效的。跳字模型优于连续词袋模型，并且为从科学文献中注释的基因/蛋白质名称找到了更多相关的术语变体。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6619/5896136/01450cb23103/13326_2018_181_Fig1_HTML.jpg

相似文献

Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature.深度学习与本体论相遇：将心血管疾病本体论锚定在生物医学文献中的实验。

J Biomed Semantics. 2018 Apr 12;9(1):13. doi: 10.1186/s13326-018-0181-1.

Exploring semantic deep learning for building reliable and reusable one health knowledge from PubMed systematic reviews and veterinary clinical notes.探索语义深度学习，以便从PubMed系统评价和兽医临床记录中构建可靠且可重复使用的一体化健康知识。

J Biomed Semantics. 2019 Nov 12;10(Suppl 1):22. doi: 10.1186/s13326-019-0212-6.

Semantic Deep Learning: Prior Knowledge and a Type of Four-Term Embedding Analogy to Acquire Treatments for Well-Known Diseases.语义深度学习：先验知识与一种用于获取知名疾病治疗方法的四项嵌入类比。

JMIR Med Inform. 2020 Aug 6;8(8):e16948. doi: 10.2196/16948.

A Case Study on Sepsis Using PubMed and Deep Learning for Ontology Learning.一个利用PubMed和深度学习进行本体学习的脓毒症案例研究。

Stud Health Technol Inform. 2017;235:516-520.

Combining lexical and context features for automatic ontology extension.基于词汇和上下文特征的本体自动扩展。

J Biomed Semantics. 2020 Jan 13;11(1):1. doi: 10.1186/s13326-019-0218-0.

Word2vec convolutional neural networks for classification of news articles and tweets.基于词向量卷积神经网络的新闻文章和推文分类。

PLoS One. 2019 Aug 22;14(8):e0220976. doi: 10.1371/journal.pone.0220976. eCollection 2019.

Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.多本体精炼嵌入模型（MORE）：一种基于混合多本体和语料库的生物医学概念语义表示模型。

J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.

Comparison of the accuracy of Japanese synonym identifications using word embeddings in the radiological technology field.利用词嵌入技术对放射技术领域日语同义词识别准确性的比较。

Sci Rep. 2023 Dec 16;13(1):22408. doi: 10.1038/s41598-023-49708-8.

BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies.BO-LSTM：通过生物医学本体论沿长短时记忆网络进行关系分类。

BMC Bioinformatics. 2019 Jan 7;20(1):10. doi: 10.1186/s12859-018-2584-5.

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。

J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.

引用本文的文献

The Epilepsy Ontology: a community-based ontology tailored for semantic interoperability and text mining.癫痫本体论：一种为语义互操作性和文本挖掘量身定制的基于社区的本体论。

Bioinform Adv. 2023 Mar 23;3(1):vbad033. doi: 10.1093/bioadv/vbad033. eCollection 2023.

A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature.一种基于门控循环单元的架构，用于从生物文献中识别本体概念。

BioData Min. 2022 Sep 28;15(1):22. doi: 10.1186/s13040-022-00310-0.

Prefrontal fNIRS-based clinical data analysis of brain functions in individuals abusing different types of drugs.基于前额叶 fNIRS 的个体滥用不同类型药物的脑功能临床数据分析。

J Biomed Semantics. 2021 Nov 25;12(1):21. doi: 10.1186/s13326-021-00256-y.

Biomedical Ontologies to Guide AI Development in Radiology.生物医学本体在放射学中的人工智能开发中的指导作用。

J Digit Imaging. 2021 Dec;34(6):1331-1341. doi: 10.1007/s10278-021-00527-1. Epub 2021 Nov 1.

Catalyzing Knowledge-Driven Discovery in Environmental Health Sciences through a Community-Driven Harmonized Language.通过社区驱动的协调语言推动环境健康科学的知识驱动发现。

Int J Environ Res Public Health. 2021 Aug 26;18(17):8985. doi: 10.3390/ijerph18178985.

Methodologically grounded semantic analysis of large volume of chilean medical literature data applied to the analysis of medical research funding efficiency in Chile.基于方法学的智利大量医学文献数据语义分析应用于智利医学研究经费效率分析。

J Biomed Semantics. 2020 Sep 29;11(1):12. doi: 10.1186/s13326-020-00226-w.

Clinical concept extraction: A methodology review.临床概念提取：方法学综述。

J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.

JMIR Med Inform. 2020 Aug 6;8(8):e16948. doi: 10.2196/16948.

DES-ROD: Exploring Literature to Develop New Links between RNA Oxidation and Human Diseases.DES-ROD：探索 RNA 氧化与人类疾病之间新联系的文献。

Oxid Med Cell Longev. 2020 Mar 27;2020:5904315. doi: 10.1155/2020/5904315. eCollection 2020.

J Biomed Semantics. 2019 Nov 12;10(Suppl 1):22. doi: 10.1186/s13326-019-0212-6.

本文引用的文献

The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.BioC-BioGRID语料库：为蛋白质-蛋白质和基因相互作用的编目而注释的全文文章。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw147. Print 2017.

BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature.BEST：用于从生物医学文献中进行知识发现的下一代生物医学实体搜索工具。

PLoS One. 2016 Oct 19;11(10):e0164680. doi: 10.1371/journal.pone.0164680. eCollection 2016.

DeepMeSH: deep semantic representation for improving large-scale MeSH indexing.深度医学主题词表：用于改进大规模医学主题词表索引的深度语义表示。

Bioinformatics. 2016 Jun 15;32(12):i70-i79. doi: 10.1093/bioinformatics/btw294.

The Proteasix Ontology.蛋白酶体六聚体本体论。

J Biomed Semantics. 2016 Jun 4;7(1):33. doi: 10.1186/s13326-016-0078-9.

MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence.医学主题词表（MeSH）标注器：通过整合多种证据提高大规模医学主题词表索引的准确性。

Bioinformatics. 2015 Jun 15;31(12):i339-47. doi: 10.1093/bioinformatics/btv237.

Deep learning.深度学习。

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

An approach to improve kernel-based Protein-Protein Interaction extraction by learning from large-scale network data.一种通过从大规模网络数据中学习来改进基于内核的蛋白质-蛋白质相互作用提取的方法。

Methods. 2015 Jul 15;83:44-50. doi: 10.1016/j.ymeth.2015.03.026. Epub 2015 Apr 9.

Editorial introduction to the Neural Networks special issue on Deep Learning of Representations.关于深度学习表示的神经网络特刊的编辑引言。

Neural Netw. 2015 Apr;64:1-3. doi: 10.1016/j.neunet.2014.12.006. Epub 2014 Dec 15.

Exploring the application of deep learning techniques on medical text corpora.探索深度学习技术在医学文本语料库上的应用。

Stud Health Technol Inform. 2014;205:584-8.

The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data.人类表型本体论项目：通过表型数据将分子生物学和疾病联系起来。

Nucleic Acids Res. 2014 Jan;42(Database issue):D966-74. doi: 10.1093/nar/gkt1026. Epub 2013 Nov 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

深度学习与本体论相遇：将心血管疾病本体论锚定在生物医学文献中的实验。

Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献