一种基于语义相似性的蛋白质-蛋白质相互作用预测方法：以 P53 相互作用激酶为例的评估。

A semantic similarity based methodology for predicting protein-protein interactions: Evaluation with P53-interacting kinases.

机构信息

Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

The Laboratory for Molecular Informatics and Data Sciences, Department of Pharmaceutical Sciences and the BRITE Institute, College of Health and Sciences, North Carolina Central University, Durham, NC 27707, USA.

出版信息

J Biomed Inform. 2020 Nov;111:103579. doi: 10.1016/j.jbi.2020.103579. Epub 2020 Sep 30.

DOI:10.1016/j.jbi.2020.103579

PMID:33007449

Abstract

Biomedical literature contains unstructured, rich information regarding proteins, ligands, diseases as well as biological pathways in which they are involved. Systematically analyzing such textual corpus has the potential for biomedical discovery of new protein-protein interactions and hidden drug indications. For this purpose, we have investigated a methodology that is based on a well-established text mining tool, Word2Vec, for the analysis of PubMed full text articles to derive word embeddings, and the use of a simple semantic similarity comparison either by itself or in conjunction with k-Nearest Neighbor (kNN) technique for the prediction of new relationships. To test this methodology, three lines of retrospective analyses of a dataset with known P53-interacting proteins have been conducted. First, we demonstrated that Word2Vec semantic similarity can infer functional relatedness among all kinases known to interact with P53. Second, in a series of time-split experiments, we demonstrated that both a simple similarity comparison and kNN models built with papers published up to a certain year were able to discover P53 interactors described in later publications. Third, in a different scenario of time-split experiments, we examined the predictions of P53-interacting proteins based on the kNN models built on data prior to a certain split year for different time ranges past that year, and found that the cumulative number of correct predictions was indeed increasing with time. We conclude that text mining of research papers in the PubMed literature based on Word2Vec analysis followed by a simple similarity comparison or kNN modeling affords excellent predictions of protein-protein interactions between P53 and kinases, and should have wide applications in translational biomedical studies such as repurposing of existing drugs, drug-drug interaction, and elucidation of mechanisms of action for drugs.

摘要

生物医学文献中包含有关蛋白质、配体、疾病以及它们所涉及的生物途径的非结构化、丰富信息。系统地分析这样的文本语料库有可能发现新的蛋白质-蛋白质相互作用和隐藏的药物适应症。为此，我们研究了一种基于文本挖掘工具 Word2Vec 的方法，用于分析 PubMed 全文文章以得出单词嵌入，并使用简单的语义相似性比较（单独使用或与 k-最近邻 (kNN) 技术结合使用）来预测新的关系。为了测试这种方法，我们对具有已知 P53 相互作用蛋白的数据集进行了三行回顾性分析。首先，我们证明了 Word2Vec 语义相似性可以推断出所有已知与 P53 相互作用的激酶之间的功能相关性。其次，在一系列时间分割实验中，我们证明了简单的相似性比较和基于截止到某一年出版的论文构建的 kNN 模型都能够发现以后发表的 P53 相互作用蛋白。第三，在不同的时间分割实验场景中，我们根据截止到某一年的分割年之前的数据构建的 kNN 模型，检查了基于 kNN 模型对 P53 相互作用蛋白的预测，对于该年之后的不同时间范围，发现正确预测的累积数量确实随着时间的推移而增加。我们得出结论，基于 Word2Vec 分析的 PubMed 文献中的研究论文的文本挖掘，然后进行简单的相似性比较或 kNN 建模，可以很好地预测 P53 与激酶之间的蛋白质-蛋白质相互作用，并且应该在转化生物医学研究中具有广泛的应用，例如现有药物的重新利用、药物-药物相互作用以及药物作用机制的阐明。

相似文献

J Biomed Inform. 2020 Nov;111:103579. doi: 10.1016/j.jbi.2020.103579. Epub 2020 Sep 30.

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.用于单词、短语和文本的无监督低维向量表示，具有透明性、可扩展性，并能产生与神经嵌入不冗余的相似性度量。

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。

Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量：在大规模上创建和评估基于文献的生物医学概念嵌入。

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类：基于k近邻算法和基于词嵌入语义分析的方法。

J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.在为美国国立医学图书馆医学主题词表（UMLS）注释的基于PubMed Central开放获取文章的语义相似性度量标准的研究中。

J Biomed Inform. 2015 Oct;57:204-18. doi: 10.1016/j.jbi.2015.07.015. Epub 2015 Aug 1.

Literature-Wide Association Studies (LWAS) for a Rare Disease: Drug Repurposing for Inflammatory Breast Cancer.针对罕见病的全文学术关联研究：炎性乳腺癌的药物再利用。

Molecules. 2020 Aug 28;25(17):3933. doi: 10.3390/molecules25173933.

引用本文的文献

Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein-protein interactions.预训练蛋白质语言模型为拟南芥蛋白质-蛋白质相互作用的预测带来新曙光。

Plant Methods. 2023 Dec 7;19(1):141. doi: 10.1186/s13007-023-01119-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种基于语义相似性的蛋白质-蛋白质相互作用预测方法：以 P53 相互作用激酶为例的评估。

A semantic similarity based methodology for predicting protein-protein interactions: Evaluation with P53-interacting kinases.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献