Department of Medical Bioinformatics, University Medical Center, Göttingen, Lower Saxony, Germany.
geneXplain GmbH, Wolfenbüttel, Lower Saxony, Germany.
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.
Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.
生物医学和生命科学文献是发表实验结果的重要途径。随着新出版物数量的快速增长,以自由文本形式呈现的科学知识数量显著增加。人们对开发能够提取这些知识并使其易于访问的技术产生了浓厚的兴趣,以帮助科学家发现生物实体之间的新关系并回答生物学问题。我们利用 word2vec 方法,基于包含超过 1600 万篇 PubMed 摘要的语料库生成单词向量表示。我们开发了一个文本挖掘管道,生成具有不同属性的 word2vec 嵌入,并进行验证实验以评估它们在生物医学分析中的效用。一个重要的预处理步骤是用生物医学数据库中的首选术语替换同义词。此外,我们从两个嵌入版本中提取基因-基因网络,并将其用作先验知识,在大型乳腺癌基因表达数据和其他癌症数据集上训练图卷积神经网络 (CNN)。对由此产生的模型的性能与使用蛋白质-蛋白质相互作用 (PPI) 网络或使用其他单词嵌入算法衍生的网络训练的 Graph-CNN 进行了比较。我们还评估了语料库大小对单词表示变异性的影响。最后,我们创建了一个带有图形和 RESTful 接口的网络服务,使用带注释的嵌入来提取和探索生物医学术语之间的关系。与生物数据库的比较表明,实体之间的关系,如已知的 PPI、信号通路和细胞功能,或更窄的疾病本体组,与更高的余弦相似度相关。与其他网络相比,使用 word2vec 嵌入衍生网络训练的 Graph-CNN 对于转移性事件预测任务的性能足够好。这种性能足以验证我们生成的单词嵌入在构建生物网络方面的实用性。因此,像 word2vec 这样的文本挖掘算法生成的单词表示能够捕捉实体之间具有生物学意义的关系。我们生成的嵌入可在 https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md 上获得。
PLoS Comput Biol. 2020-4-23
BMC Med Inform Decis Mak. 2018-7-23
BMC Bioinformatics. 2019-1-7
J Biomed Inform. 2018-9-12
J Biomed Inform. 2018-3-27
Brief Bioinform. 2021-3-22
Br J Biomed Sci. 2025-8-5
BioTechnologia (Pozn). 2025-3-31
Sci Rep. 2025-5-18
NAR Genom Bioinform. 2023-11-2
Front Physiol. 2022-8-31
NAR Genom Bioinform. 2021-12-8
Nucleic Acids Res. 2021-1-8
PLoS Comput Biol. 2020-4-23
Nucleic Acids Res. 2020-1-8
Bioinformatics. 2020-2-15
Stud Health Technol Inform. 2019-9-3
BMC Genomics. 2019-2-4
Nucleic Acids Res. 2019-1-8