基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

机构信息

Department of Medical Bioinformatics, University Medical Center, Göttingen, Lower Saxony, Germany.

geneXplain GmbH, Wolfenbüttel, Lower Saxony, Germany.

出版信息

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

DOI:10.1371/journal.pone.0258623

PMID:34653224

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8519453/

Abstract

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

摘要

生物医学和生命科学文献是发表实验结果的重要途径。随着新出版物数量的快速增长，以自由文本形式呈现的科学知识数量显著增加。人们对开发能够提取这些知识并使其易于访问的技术产生了浓厚的兴趣，以帮助科学家发现生物实体之间的新关系并回答生物学问题。我们利用 word2vec 方法，基于包含超过 1600 万篇 PubMed 摘要的语料库生成单词向量表示。我们开发了一个文本挖掘管道，生成具有不同属性的 word2vec 嵌入，并进行验证实验以评估它们在生物医学分析中的效用。一个重要的预处理步骤是用生物医学数据库中的首选术语替换同义词。此外，我们从两个嵌入版本中提取基因-基因网络，并将其用作先验知识，在大型乳腺癌基因表达数据和其他癌症数据集上训练图卷积神经网络 (CNN)。对由此产生的模型的性能与使用蛋白质-蛋白质相互作用 (PPI) 网络或使用其他单词嵌入算法衍生的网络训练的 Graph-CNN 进行了比较。我们还评估了语料库大小对单词表示变异性的影响。最后，我们创建了一个带有图形和 RESTful 接口的网络服务，使用带注释的嵌入来提取和探索生物医学术语之间的关系。与生物数据库的比较表明，实体之间的关系，如已知的 PPI、信号通路和细胞功能，或更窄的疾病本体组，与更高的余弦相似度相关。与其他网络相比，使用 word2vec 嵌入衍生网络训练的 Graph-CNN 对于转移性事件预测任务的性能足够好。这种性能足以验证我们生成的单词嵌入在构建生物网络方面的实用性。因此，像 word2vec 这样的文本挖掘算法生成的单词表示能够捕捉实体之间具有生物学意义的关系。我们生成的嵌入可在 https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/0bff5a1d1145/pone.0258623.g001.jpg

相似文献

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies.

BMC Bioinformatics. 2019 Jan 7;20(1):10. doi: 10.1186/s12859-018-2584-5.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

FuseLinker: Leveraging LLM's pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs.

J Biomed Inform. 2024 Oct;158:104730. doi: 10.1016/j.jbi.2024.104730. Epub 2024 Sep 24.

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.

Sci Rep. 2024 Jul 12;14(1):16117. doi: 10.1038/s41598-024-66319-z.

A hybrid model based on neural networks for biomedical relation extraction.

J Biomed Inform. 2018 May;81:83-92. doi: 10.1016/j.jbi.2018.03.011. Epub 2018 Mar 27.

Unsupervised and self-supervised deep learning approaches for biomedical text mining.

Brief Bioinform. 2021 Mar 22;22(2):1592-1603. doi: 10.1093/bib/bbab016.

引用本文的文献

Artificial Intelligence in Biomedical Sciences: A Scoping Review.

Br J Biomed Sci. 2025 Aug 5;82:14362. doi: 10.3389/bjbs.2025.14362. eCollection 2025.

Cutting-edge AI tools revolutionizing scientific research in life sciences.

BioTechnologia (Pozn). 2025 Mar 31;106(1):77-102. doi: 10.5114/bta/200803. eCollection 2025.

Predicting drug-gene relations via analogy tasks with word embeddings.

Sci Rep. 2025 May 18;15(1):17240. doi: 10.1038/s41598-025-01418-z.

Evaluation of input data modality choices on functional gene embeddings.

NAR Genom Bioinform. 2023 Nov 2;5(4):lqad095. doi: 10.1093/nargab/lqad095. eCollection 2023 Dec.

Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context.

Front Mol Biosci. 2022 Sep 8;9:962799. doi: 10.3389/fmolb.2022.962799. eCollection 2022.

Large scale text mining for deriving useful insights: A case study focused on microbiome.

Front Physiol. 2022 Aug 31;13:933069. doi: 10.3389/fphys.2022.933069. eCollection 2022.

Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer.

NAR Genom Bioinform. 2021 Dec 8;3(4):lqab113. doi: 10.1093/nargab/lqab113. eCollection 2021 Dec.

本文引用的文献

Explaining decisions of graph convolutional neural networks: patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer.

Genome Med. 2021 Mar 11;13(1):42. doi: 10.1186/s13073-021-00845-7.

Ensembl 2021.

Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891. doi: 10.1093/nar/gkaa942.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

The reactome pathway knowledgebase.

Nucleic Acids Res. 2020 Jan 8;48(D1):D498-D503. doi: 10.1093/nar/gkz1031.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Utilizing Molecular Network Information via Graph Convolutional Neural Networks to Predict Metastatic Event in Breast Cancer.

Stud Health Technol Inform. 2019 Sep 3;267:181-186. doi: 10.3233/SHTI190824.

BioWordVec, improving biomedical word embeddings with subword information and MeSH.

Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.

Gene2vec: distributed representation of genes based on co-expression.

BMC Genomics. 2019 Feb 4;20(Suppl 1):82. doi: 10.1186/s12864-018-5370-x.

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

Nucleic Acids Res. 2019 Jan 8;47(D1):D607-D613. doi: 10.1093/nar/gky1131.

Human Disease Ontology 2018 update: classification, content and workflow expansion.

Nucleic Acids Res. 2019 Jan 8;47(D1):D955-D962. doi: 10.1093/nar/gky1032.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献