文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

机构信息

Department of Medical Bioinformatics, University Medical Center, Göttingen, Lower Saxony, Germany.

geneXplain GmbH, Wolfenbüttel, Lower Saxony, Germany.

出版信息

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.


DOI:10.1371/journal.pone.0258623
PMID:34653224
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8519453/
Abstract

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

摘要

生物医学和生命科学文献是发表实验结果的重要途径。随着新出版物数量的快速增长,以自由文本形式呈现的科学知识数量显著增加。人们对开发能够提取这些知识并使其易于访问的技术产生了浓厚的兴趣,以帮助科学家发现生物实体之间的新关系并回答生物学问题。我们利用 word2vec 方法,基于包含超过 1600 万篇 PubMed 摘要的语料库生成单词向量表示。我们开发了一个文本挖掘管道,生成具有不同属性的 word2vec 嵌入,并进行验证实验以评估它们在生物医学分析中的效用。一个重要的预处理步骤是用生物医学数据库中的首选术语替换同义词。此外,我们从两个嵌入版本中提取基因-基因网络,并将其用作先验知识,在大型乳腺癌基因表达数据和其他癌症数据集上训练图卷积神经网络 (CNN)。对由此产生的模型的性能与使用蛋白质-蛋白质相互作用 (PPI) 网络或使用其他单词嵌入算法衍生的网络训练的 Graph-CNN 进行了比较。我们还评估了语料库大小对单词表示变异性的影响。最后,我们创建了一个带有图形和 RESTful 接口的网络服务,使用带注释的嵌入来提取和探索生物医学术语之间的关系。与生物数据库的比较表明,实体之间的关系,如已知的 PPI、信号通路和细胞功能,或更窄的疾病本体组,与更高的余弦相似度相关。与其他网络相比,使用 word2vec 嵌入衍生网络训练的 Graph-CNN 对于转移性事件预测任务的性能足够好。这种性能足以验证我们生成的单词嵌入在构建生物网络方面的实用性。因此,像 word2vec 这样的文本挖掘算法生成的单词表示能够捕捉实体之间具有生物学意义的关系。我们生成的嵌入可在 https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/fe3131288ba8/pone.0258623.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/0bff5a1d1145/pone.0258623.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/294662a83afe/pone.0258623.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/fe3131288ba8/pone.0258623.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/0bff5a1d1145/pone.0258623.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/294662a83afe/pone.0258623.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df37/8519453/fe3131288ba8/pone.0258623.g003.jpg

相似文献

[1]
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

PLoS One. 2021

[2]
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.

PLoS Comput Biol. 2020-4-23

[3]
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.

BMC Med Inform Decis Mak. 2018-7-23

[4]
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

J Biomed Inform. 2019-2

[5]
BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies.

BMC Bioinformatics. 2019-1-7

[6]
A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018-9-12

[7]
FuseLinker: Leveraging LLM's pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs.

J Biomed Inform. 2024-10

[8]
Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.

Sci Rep. 2024-7-12

[9]
A hybrid model based on neural networks for biomedical relation extraction.

J Biomed Inform. 2018-3-27

[10]
Unsupervised and self-supervised deep learning approaches for biomedical text mining.

Brief Bioinform. 2021-3-22

引用本文的文献

[1]
Artificial Intelligence in Biomedical Sciences: A Scoping Review.

Br J Biomed Sci. 2025-8-5

[2]
Cutting-edge AI tools revolutionizing scientific research in life sciences.

BioTechnologia (Pozn). 2025-3-31

[3]
Predicting drug-gene relations via analogy tasks with word embeddings.

Sci Rep. 2025-5-18

[4]
Evaluation of input data modality choices on functional gene embeddings.

NAR Genom Bioinform. 2023-11-2

[5]
Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context.

Front Mol Biosci. 2022-9-8

[6]
Large scale text mining for deriving useful insights: A case study focused on microbiome.

Front Physiol. 2022-8-31

[7]
Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer.

NAR Genom Bioinform. 2021-12-8

本文引用的文献

[1]
Explaining decisions of graph convolutional neural networks: patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer.

Genome Med. 2021-3-11

[2]
Ensembl 2021.

Nucleic Acids Res. 2021-1-8

[3]
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.

PLoS Comput Biol. 2020-4-23

[4]
The reactome pathway knowledgebase.

Nucleic Acids Res. 2020-1-8

[5]
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020-2-15

[6]
Utilizing Molecular Network Information via Graph Convolutional Neural Networks to Predict Metastatic Event in Breast Cancer.

Stud Health Technol Inform. 2019-9-3

[7]
BioWordVec, improving biomedical word embeddings with subword information and MeSH.

Sci Data. 2019-5-10

[8]
Gene2vec: distributed representation of genes based on co-expression.

BMC Genomics. 2019-2-4

[9]
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

Nucleic Acids Res. 2019-1-8

[10]
Human Disease Ontology 2018 update: classification, content and workflow expansion.

Nucleic Acids Res. 2019-1-8

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索